When More is Less: Understanding Chain-of-Thought Length in LLMs
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length’s scaling laws and the emergence of simplicity bias during RL.
Introduction. Large language models (LLMs) have demonstrated impressive capabilities in solving complex reasoning tasks [3, 36]. A key technique for its success is Chain-of-Thought (CoT) reasoning [38]. By generating explicit intermediate reasoning steps, CoT allows models to break down complex problems into simpler, more manageable sub-problems, akin to a divide-and-conquer strategy [44]. A common intuition, supported by some research [12, 20], is that longer and more detailed CoT processes generally lead to better performance, especially for difficult tasks. Meanwhile, recent observations also suggest that concise CoTs can sometimes be effective, albeit with potential performance In this paper, through a comprehensive combination of evidence from theoretical analysis, controlled synthetic experiments, and real-world observations, we show that for CoT length, longer is not always better. As illustrated by the trend in Figure 1a , when plotting task accuracy against measures related to the CoT length, performance typically follows an inverted U-shaped curve.
Discussion / Conclusion. In this paper, we challenged the notion that longer Chain-of-Thought (CoT) processes are invariably superior, demonstrating through extensive experiments and theoretical analysis that CoT length and accuracy typically follow an inverted U-shaped curve, implying an optimal length that balances task decomposition against error accumulation. We discovered the simplicity bias of CoT, where more capable models prefer shorter effective reasoning paths, and formally derived scaling laws for this optimal length relative to model capability and task difficulty. Practically, we showed that reinforcement learning can guide models towards this optimal CoT length, that training on optimally-lengthed CoTs boosts performance, and proposed "Length-Filtered Vote" as a promising inference strategy. Our work underscores the critical need to calibrate CoT length, moving beyond a one-size-fits-all approach towards a principled framework where LLMs adaptively choose the right amount of thought to optimize reasoning.