Why does chain of thought accuracy eventually decline with length?
Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
The "longer is better" assumption for CoT has an empirical ceiling: task accuracy initially improves with CoT length, reaches a peak, then decreases. The inverted-U curve applies across models and tasks, and its peak location follows consistent patterns.
Two scaling laws for optimal CoT length:
Difficulty scaling — optimal length increases with task difficulty. Harder problems benefit from longer chains because more decomposition steps are needed. This part matches intuition.
Capability scaling — optimal length decreases with model capability. More capable models find more efficient paths to correct answers and require fewer steps. Using the same long chains for a more capable model is counterproductive.
The second law has a practical consequence: treating all models identically (same token budget, same chain length) misallocates compute. A model that can solve a problem in 5 steps should not be given budgets designed for a 20-step solution.
Simplicity bias as a training-emergent property: RL training reveals this dynamic in action. As RL training improves accuracy, models gravitate toward shorter CoTs — not because they were explicitly trained to be concise, but because shorter chains produce correct answers and RL rewards correct answers. The simplicity bias emerges automatically from the reward signal.
This connects to Why do correct reasoning traces contain fewer tokens? — the same empirical signal: shorter chains are correct chains. The inverted-U explains why: length past the optimal point introduces accumulation of decomposition errors and contextual noise (see Do models fail worse when their own errors fill the context?).
The practical implication: train on optimally-lengthed CoTs (not maximal-length), and at inference, use length-aware filtering to discard excessively long chains. The simplicity bias is not a failure mode — it is a signal of genuine capability.
Inquiring lines that use this note as a source 228
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does chain-of-thought text causally drive reasoning or merely reflect it?
- Why do models commit to answers early on easy versus hard tasks?
- Why does step-by-step reasoning fail when tool outputs get very large?
- Why does chain-of-thought reasoning hurt recommendation tasks specifically?
- How much does faithfulness vary naturally in reasoning without evaluation pressure?
- Does the heuristic dominance ratio vary predictably across model architectures?
- What is the relationship between reasoning depth and verbalization requirements?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- Why do simple length heuristics outperform sophisticated semantic methods?
- What is the critical thinking token threshold beyond which accuracy degrades?
- How do verbose and concise reasoning occupy different regions in activation space?
- Can meaning-level metrics like Semantic Entropy avoid length bias?
- Can penalizing reasoning transitions fix underthinking without fine-tuning models?
- Why does retrieval chain training unlock scaling laws in QA?
- Are correct reasoning traces measurably shorter than incorrect ones?
- Does iterative denoising order affect the reasoning style diffusion models learn?
- What makes diffusion chain-of-thought reasoning qualitatively different from sequential chain-of-thought?
- How does optimizing for accuracy during training degrade downstream reasoning quality?
- Why do contrastive reasoning approaches outperform single-path belief evaluation?
- Why do chain-of-thought prompts work if reasoning is not systematic?
- How much accuracy is preserved when removing explanatory layers from reasoning traces?
- What advantages emerge from running 13 times more parallel reasoning chains with the same budget?
- Does the timing of AI feedback relative to user reasoning change its effectiveness?
- Can marginal hints integrate better into reasoning than comprehensive explanations?
- Do explicit reasoning chains improve or harm performance on complex judgment tasks?
- Why do top performers produce shorter chains of thought in their strongest domains?
- How much does annotator style actually influence chain-of-thought prompting performance?
- How do self-revisions degrade reasoning accuracy in extended traces?
- Why do correct reasoning traces in language models tend to be shorter?
- Why do logically invalid chain-of-thought examples work nearly as well?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- Can chain-of-thought faithfulness exist without causal necessity in reasoning?
- Why do correct reasoning traces appear shorter than incorrect ones?
- Can chain-of-thought explanations be both sufficient and necessary for model decisions?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- Can extended thinking genuinely improve reasoning or just increase variance?
- What happens to chain-of-thought performance across distribution shifts?
- Why do more capable models prefer shorter chains of thought?
- Can concise reasoning traces match verbose explanation accuracy?
- Can testing prior knowledge and checking understanding improve explanation outcomes?
- What makes Compound-QA expose weaknesses in monologue reasoning?
- How reliable is the top-2 confidence gap as a stopping signal across tasks?
- Do reasoning models show the same answer-maintenance pattern that diffusion models exhibit?
- Does explicit reasoning help or hurt tasks requiring continuous nuanced judgment?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
- Why do models automatically adjust reasoning length to problem difficulty?
- What determines the optimal thinking token threshold for a given task?
- Why does explicit reasoning degrade passage reranking performance?
- Why does reasoning accuracy degrade beyond a critical thinking token threshold?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- What determines the finite chain length where robustness improvements plateau?
- Can subtask-level voting replace sequential revision for improving long-horizon task accuracy?
- Why does domain accuracy improve while reasoning quality degrades after supervised fine-tuning?
- When does explicit reasoning actually degrade performance on a task?
- Why are correct reasoning traces consistently shorter than incorrect ones?
- How should iterative research tasks limit context per reasoning turn?
- Why do reasoning models perform worse on theory of mind tasks?
- Why do diffusion LLM answer tokens converge in confidence long before reasoning stabilizes?
- How do autoregressive models constrain where chain-of-thought prompts can be positioned?
- How do chain-of-thought structures affect reasoning robustness?
- Why do temporal reasoning patterns matter more than final answers?
- What cognitive constraints limit how complex a deception can become?
- Why do simple math problems get worse with longer reasoning chains?
- How should inference budget adapt based on problem difficulty?
- How do covert thoughts differ from chain-of-thought reasoning in language models?
- Why does step-by-step reasoning degrade performance on judgment-based tasks?
- How should reasoning prompts adapt based on question complexity and type?
- What saliency patterns distinguish successful from failed chain-of-thought reasoning?
- Why does self-revision degrade reasoning accuracy in o1-like models?
- How does random walk length control reasoning complexity in question generation?
- How do graph topology properties like cyclicity and diameter affect reasoning quality?
- How does chain-of-thought training change higher layer computations?
- Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
- Why do longer reasoning chains signal hesitation rather than depth?
- Does reasoning structure match explicit versus implicit task demands?
- How does evaluation format change what we measure about model reasoning?
- Does chain-of-thought reasoning specifically improve performance on metalinguistic tasks?
- Why does extended reasoning fail for search and knowledge retrieval tasks?
- Does chain-of-thought reasoning improve mental state tracking in dialogue?
- What structural properties define effective long chain-of-thought reasoning?
- What reasoning token threshold marks the accuracy degradation point?
- How does overthinking in early turns degrade later retrieval rounds?
- How do adversarial triggers bypass the protections of longer reasoning chains?
- Does chain-of-thought reasoning amplify bullshit or just make it more visible?
- Do shorter reasoning traces actually produce more reliable model outputs?
- How does meta-reasoning combine information distributed across multiple chains?
- How does self-revision in reasoning chains amplify confidence in wrong answers?
- Do chain-of-thought explanations reveal genuine reasoning or trigger latent features?
- Why does reasoning effort fail to improve theory of mind performance?
- Does distillation from reasoning models spread overthinking to smaller models?
- Why do correct reasoning traces tend to be shorter than incorrect ones?
- Why does extended thinking increase output variance without improving reasoning quality?
- Can parallel reasoning chains outperform longer sequential chains with the same compute?
- Why does parallel thinking outperform sequential thinking under the same token budget?
- When does sequential reasoning provide exponential advantages over parallel voting?
- What makes diverse reasoning sources more valuable than deeper single paths?
- Why does parallel thinking outperform sequential thinking with equal tokens?
- When does sequential chain-of-thought dramatically beat parallel voting approaches?
- Does deep-thinking ratio measure computational effort better than chain-of-thought length?
- Why do introverted agents produce longer and more detailed reasoning traces?
- How do insert, forget, and merge operations maintain thought coherence over time?
- Does thought consolidation address the confirmatory reflection problem in reasoning models?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- Why does intermediate step quality predict reasoning outcomes better than global features?
- How do exemplar properties affect the brittleness of chain-of-thought prompting?
- Why do reasoning models fail when input length increases even below context limits?
- What makes parallel thinking more efficient than sequential chains?
- What happens to reasoning accuracy when models use more thinking tokens?
- How does chain-of-thought pressure models to rationalize pattern exceptions?
- Why does chain-of-thought prompting fail to fix length-induced reasoning degradation?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- What explains the gap between perplexity performance and actual reasoning capability?
- Why does parallel thinking outperform sequential thinking under token limits?
- Which sentences in reasoning traces actually influence the final answer?
- How do longer reasoning chains create vulnerability to attacks?
- What three factors actually drive chain of thought performance improvements?
- Why do larger reasoning models show cyclicity only in later layers?
- Why do format and structure matter more than actual content in reasoning?
- Why do we measure reasoning quality by reading visible chains?
- Why does revision often make reasoning accuracy worse in frontier models?
- Why does outcome supervision fail for long reasoning chains?
- Why do chain-of-thought outputs look logical but perform rhetorically?
- How does reinforcement learning differ from chain-of-thought distillation?
- Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- How does chain-of-thought length affect attention to constraint tokens?
- What causes length bias in language model reward models?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- When are multiple independent attempts more valuable than depth?
- Does chain-of-thought prompting overcome implicit meaning deficits in text analysis?
- How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?
- Does SFT degrade reasoning quality while improving domain accuracy?
- Do longer reasoning traces actually improve theory of mind accuracy?
- Does explicit reasoning help or hurt tasks requiring continuous judgment?
- Why does increasing reasoning not improve AI social reasoning performance?
- Can continuous latent reasoning match discrete chain-of-thought without training modifications?
- Can reasoning evaluation metrics reward actual reasoning instead of theater?
- Why do readability and style metrics plateau while reasoning improves with scale?
- How does extended thinking affect variance in reasoning model outputs?
- Do correct reasoning traces tend to be shorter than incorrect ones?
- When should a system choose extended thinking versus quick responses?
- How should timing for reasoning intervention be determined during inference?
- What metric distinguishes deep reasoning from superficial information propagation?
- Why do some reasoning steps receive negligible attention from later steps?
- Can models learn to stop thinking when a question lacks necessary information?
- How does chain of thought amplify specific forms of rhetorical bullshit?
- How much does chain-of-thought reasoning narrow the decompression gap?
- Does internal self-revision actually degrade reasoning accuracy in models?
- What makes a first answer so often the best answer a model produces?
- Why do benchmark scores rise while reasoning quality declines?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- How much reasoning depth do we actually need for most real-world tasks?
- Do shorter reasoning chains maintain instruction adherence better than longer ones?
- Do reasoning failures stem from strategy or from calculation breakdown?
- Do reward reasoning models with chain-of-thought reasoning evaluate prompts better?
- What is the optimal balance between search rounds and reasoning depth per round?
- How much does extended thinking actually improve model reasoning ability?
- Does penalizing thought transitions improve reasoning without model retraining?
- Why does additional reasoning effort not improve theory of mind performance?
- Does the answer stage perform substantial reasoning beyond the thinking draft?
- Can thinking token density explain reasoning performance beyond total length?
- Can we improve reasoning by amplifying information at mutual information peaks?
- Do shorter correct reasoning traces contain more thought anchors than longer ones?
- Why do familiar patterns that support correct answers sometimes drive errors?
- How does backtracking capability address error compounding in chain-of-thought reasoning?
- Why does failed step fraction predict reasoning quality better than trace length?
- Does chain of thought reasoning faithfully reflect what a model actually believes?
- Why do correct reasoning traces stay shorter than incorrect ones?
- Why are incorrect reasoning traces longer than correct ones?
- Can minimal reasoning steps match verbose reasoning accuracy?
- What mechanisms cause reasoning models to wander rather than focus?
- Why do per-turn thinking budgets matter alongside iterative retrieval depth?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Does more thinking always improve language model accuracy?
- Does task difficulty alone determine how many thinking tokens a model should use?
- Can layer-wise prediction stabilization identify when genuine reasoning has stopped?
- What happens to model reasoning accuracy as thinking token requirements exceed critical thresholds?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- Why do concise reasoning chains match verbose chain-of-thought token efficiency?
- How does interaction horizon differ from chain-of-thought depth?
- Why does chain-of-thought fail to improve multimodal model perception performance?
- What happens to iterative search quality when reasoning depth is unconstrained?
- Why does reasoning volume fail to improve theory of mind performance?
- Does trace length actually reflect problem difficulty or training proximity?
- Do longer chain-of-thought traces improve interpretability or just performance?
- Are chain-of-thought traces anthropomorphizing how AI models really reason?
- Can chain-of-thought traces harm rather than help user understanding?
- What causes reasoning quality to degrade during long research tasks?
- Can benchmark improvements hide degradation of deliberative reasoning?
- Why does per-step deliberation lose global perspective compared to dynamic discovery?
- Why do smaller models lose reasoning faithfulness more than larger models?
- Why might chain-of-thought reasoning bypass action selection pathways?
- Can bounded workspaces prevent overthinking better than summarization alone?
- What makes answer equivalence sufficient to discard a reasoning path?
- Why do macro and micro forecasting scales require different reasoning approaches?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- Does chain-of-thought accuracy degrade with longer reasoning traces?
- Why does reasoning catalyst data remain stable across multiple self-improvement iterations?
- Why does uniform averaging across all tokens dilute the reasoning signal?
- Why does parallel thinking outperform sequential thinking under fixed token budgets?
- Why do longer reasoning chains explore like tourists instead of scientists?
- What causes reward models to favor length and sycophancy?
- Does reasoning style transfer matter more than solution correctness in distillation?
- How does self-distillation degrade reasoning by suppressing uncertainty signals?
- Why do shorter confident reasoning traces fail on out-of-distribution problems?
- Why do thinking models execute longer tasks than standard language models?
- How do frontier models maintain agreement scores above 90 percent across reasoning tasks?
- What distinguishes metacognitive regulation from standard chain-of-thought reasoning?
- Why does SFT fail when expert demonstrations are too long for small models?
- Why does target probability matter more than task logical complexity?
- What computational structures can actually scale serial reasoning depth?
- What makes o1's chain-of-thought processing specifically effective for exploration tasks?
- How does confidence filtering improve selection of reasoning traces?
- Can models learn to optimize their own chain-of-thought generation?
- What makes some bottlenecks invisible to chain-of-thought training?
- Why does chain-of-thought work for math but fail for grounding?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- Why does exemplar performance vary across order complexity diversity and style?
- Why are shorter reasoning traces more reliable than longer correct ones?
- How brittle are chain-of-thought exemplars across order and complexity?
- Why does reasoning backward enable better forward reasoning performance?
- How much of chain-of-thought reasoning actually diverges from the final answer?
- What makes multi-turn critique trajectories more effective than single-turn reasoning chains?
- How does question difficulty and breadth affect what models learn to reason?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
the empirical observation; this note provides the theoretical model explaining it
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the threshold is not fixed: this note shows it's a function of task difficulty and model capability
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
past the optimal length, variance inflation dominates over quality improvement
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
parallel approaches avoid the problem by distributing tokens across independent chains rather than extending one chain past its optimum
-
Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
empirical operationalization: CoD demonstrates that capable models can achieve full accuracy at 7.6% of standard CoT length, matching the inverted-U prediction that more capable models prefer dramatically shorter chains; the 92.4% of removed tokens were on the declining side of the curve
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
training-time implementation: the generous-to-tight curriculum naturally navigates the inverted-U by allowing exploration of the full curve during early training then compressing to the optimal point; models discover the peak with generous budgets and descend toward conciseness under tightening constraints
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- Break the Chain: Large Language Models Can be Shortcut Reasoners
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- Understanding and Mitigating Premature Confidence for Better LLM Reasoning
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Original note title
optimal cot length follows an inverted-u — more capable models prefer shorter cot