SYNTHESIS NOTE

Why does chain of thought accuracy eventually decline with length?

Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques

The "longer is better" assumption for CoT has an empirical ceiling: task accuracy initially improves with CoT length, reaches a peak, then decreases. The inverted-U curve applies across models and tasks, and its peak location follows consistent patterns.

Two scaling laws for optimal CoT length:

Difficulty scaling — optimal length increases with task difficulty. Harder problems benefit from longer chains because more decomposition steps are needed. This part matches intuition.
Capability scaling — optimal length decreases with model capability. More capable models find more efficient paths to correct answers and require fewer steps. Using the same long chains for a more capable model is counterproductive.

The second law has a practical consequence: treating all models identically (same token budget, same chain length) misallocates compute. A model that can solve a problem in 5 steps should not be given budgets designed for a 20-step solution.

Simplicity bias as a training-emergent property: RL training reveals this dynamic in action. As RL training improves accuracy, models gravitate toward shorter CoTs — not because they were explicitly trained to be concise, but because shorter chains produce correct answers and RL rewards correct answers. The simplicity bias emerges automatically from the reward signal.

This connects to Why do correct reasoning traces contain fewer tokens? — the same empirical signal: shorter chains are correct chains. The inverted-U explains why: length past the optimal point introduces accumulation of decomposition errors and contextual noise (see Do models fail worse when their own errors fill the context?).

The practical implication: train on optimally-lengthed CoTs (not maximal-length), and at inference, use length-aware filtering to discard excessively long chains. The simplicity bias is not a failure mode — it is a signal of genuine capability.

Inquiring lines that read this note 245

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why does chain of thought accuracy eventually decline with length?

Inquiring lines that read this note 245

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4