SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Does more thinking time always improve reasoning accuracy?

Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The prevailing assumption that "more thinking tokens = better reasoning" is empirically false beyond a critical point. Pushing the average thinking token count from ~1,100 to ~15,980 reduced accuracy from 87.3% to 70.3% on the same benchmark.

This non-monotonic relationship — initial improvement followed by steady decline — is consistent across multiple tasks and datasets. The researchers call the degradation phase "overthinking," and it has been largely invisible in prior work because most studies only reported the improving phase of the curve.

The practical implication: there is a sweet spot, and token budgets above it actively harm performance. Current practice of using "more tokens" as a proxy for "more reasoning" is not just wasteful — it is counterproductive past the threshold. Since Does extended thinking actually improve reasoning or just increase variance?, the gains before the threshold aren't even what they appear to be.

The bidirectional calibration failure (Between Underthinking and Overthinking): The relationship is not just non-monotonic — models miscalibrate in both directions. For easy questions, models often detect difficulty increases and extend reasoning appropriately. But for hard questions beyond their capability, models underthink — failing to recognize difficulty or lacking the knowledge to respond effectively, producing responses shorter than needed. The result: models overthink easy problems (generating unnecessarily long outputs) and underthink hard ones (failing to extend reasoning when most needed).

Length-based preference optimization provides a surprising intervention: fine-tuning to prefer shorter responses — using only unlabeled data, without ground-truth labels — maintains relatively strong accuracy while reducing token length. The reduction is disproportionately from incorrect responses (which are significantly longer), but 10-25% reduction on correct responses is also observed. This suggests models have latent ability to calibrate difficulty for easy problems but retain an overthinking tendency that preference optimization can reduce.

PI framework: the attention-level mechanism behind the threshold: The PI (Test-time Prompt Intervention) framework provides the attention-level mechanism that explains why the threshold exists. Visualizing attention maps across reasoning steps reveals that verification and backtracking steps (e.g., steps 7-8 in a typical trace) receive minimal subsequent attention — the model generates them but barely reads them. After generating the correct answer step, all following steps predominantly attend to that pivotal moment rather than to intermediate verification. The critical steps — those whose predecessors all receive high attention — can reproduce the reasoning with 75% fewer steps. This transforms the behavioral observation (accuracy degrades with more tokens) into a mechanistic explanation: redundant tokens are attention-invisible, contributing neither signal nor structure to the final answer. The overthinking region is precisely where token generation has detached from the attention graph that actually drives outputs. Source: Prompts Prompting.

Optimal reasoning token ratio exists but models cannot reach it. ZebraLogic's analysis of constraint satisfaction problems shows that there exists an optimal ratio of reasoning tokens to problem complexity (measured by Z3 solver conflicts). O1-like models scale reasoning tokens with complexity and approach this optimal ratio for moderate problems, but cannot reach it when complexity is extremely high — the reasoning effort ceiling is below what the problem requires. Self-verification prompting provides only marginal improvement (31.7% → 33.0% → 32.1% on second iteration), suggesting the bottleneck is not insufficient verification but insufficient reasoning depth. The optimal ratio finding quantifies the threshold: the sweet spot is not just "not too many tokens" but a specific relationship between problem difficulty and reasoning budget.

S1-Bench (2025) reveals that LRMs can prejudge question simplicity — especially in Chinese — but thinking length does NOT shorten despite this prejudgment. Models generate unnecessary solution rounds after reaching the correct answer, repeatedly reverifying simple problems already solved. Models with longer thinking processes produce more excessive solution rounds. Furthermore, LRMs sometimes include incorrect intermediate conclusions in their reasoning even when ultimately reaching correct final answers, and sometimes reach the correct answer during reasoning but then deviate to produce incorrect final conclusions. The prejudgment finding is architecturally important: it suggests the overthinking mechanism is not caused by inability to assess difficulty, but by an inability to act on that assessment — the model "knows" the problem is simple but cannot truncate its reasoning accordingly. Source: Arxiv/Evaluations.

S1-Bench's architectural deepening — difficulty is linearly probable from hidden states; the failure is action not perception. The full S1-Bench study (28 LRMs across multi-domain, multilingual model-simple questions) goes beyond the prejudgment-but-no-truncation observation. Using DS-R1-1.5B and DS-R1-7B as representative cases, a single-layer MLP trained on the final-layer hidden state of the last token in the encoded question predicts question difficulty with monotonically increasing accuracy as difficulty rises. The structure is already there — implicit, linear, decodable without specialized probes. Yet behaviorally, LRMs still produce redundant solution rounds with higher average token entropy on the same questions the probe correctly classifies as easy. The authors interpret this as architectural self-doubt: the model perceives simplicity, then second-guesses its own perception, leading to exploratory generation that overrides the implicit difficulty signal. This localizes the failure to the perception-to-action interface — not to representational capacity, not to difficulty assessment. The probe-vs-behavior gap is the diagnostic; it predicts that mechanistic interventions routing generation through the difficulty representation should outperform prompt-engineered "answer briefly" instructions, which target the wrong layer. Source: Reasoning Methods CoT ToT.

Inquiring lines that use this note as a source 143

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
29 direct connections · 246 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning accuracy degrades beyond a critical thinking-token threshold