SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Why does chain of thought accuracy eventually decline with length?

Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time?

The "longer is better" assumption for CoT has an empirical ceiling: task accuracy initially improves with CoT length, reaches a peak, then decreases. The inverted-U curve applies across models and tasks, and its peak location follows consistent patterns.

Two scaling laws for optimal CoT length:

  1. Difficulty scaling — optimal length increases with task difficulty. Harder problems benefit from longer chains because more decomposition steps are needed. This part matches intuition.

  2. Capability scaling — optimal length decreases with model capability. More capable models find more efficient paths to correct answers and require fewer steps. Using the same long chains for a more capable model is counterproductive.

The second law has a practical consequence: treating all models identically (same token budget, same chain length) misallocates compute. A model that can solve a problem in 5 steps should not be given budgets designed for a 20-step solution.

Simplicity bias as a training-emergent property: RL training reveals this dynamic in action. As RL training improves accuracy, models gravitate toward shorter CoTs — not because they were explicitly trained to be concise, but because shorter chains produce correct answers and RL rewards correct answers. The simplicity bias emerges automatically from the reward signal.

This connects to Why do correct reasoning traces contain fewer tokens? — the same empirical signal: shorter chains are correct chains. The inverted-U explains why: length past the optimal point introduces accumulation of decomposition errors and contextual noise (see Do models fail worse when their own errors fill the context?).

The practical implication: train on optimally-lengthed CoTs (not maximal-length), and at inference, use length-aware filtering to discard excessively long chains. The simplicity bias is not a failure mode — it is a signal of genuine capability.

Inquiring lines that use this note as a source 228

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 158 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

optimal cot length follows an inverted-u — more capable models prefer shorter cot