Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Paper · arXiv 2501.18585 · Published January 30, 2025
Reasoning Model ArchitecturesReasoning Critiques

Large language models (LLMs) such as OpenAI’s o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting humanlike deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty (TIP) that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning.

Introduction. Large Language Models (LLMs), such as OpenAI’s o1 (OpenAI, 2024), have revolutionized artificial intelligence by enabling models to tackle increasingly complex tasks. The o1 model and its replicas (Qwen, 2024; DeepSeek, 2025; Kimi, 2025), known for their deep reasoning capabilities, exemplify the potential of LLMs to exhibit human-like deep thinking by scaling test-time computation during problem-solving. These models aim to explore diverse reasoning strategies, reflect on their decisions, and iteratively refine solutions, closely mimicking human cognitive processes. Despite their successes, a critical yet underexplored question remains: Are o1-like LLMs thinking deeply enough? This study provides an initial exploration of this problem. In this work, we investigate a phenomenon we term underthinking, which refers to the tendency of o1-like LLMs to prematurely abandon promising lines of reasoning, leading to inadequate depth of thought.

Discussion / Conclusion. In this work, we investigated underthinking in o1-like LLMs, identifying it as a significant factor limiting their performance on challenging reasoning tasks. Through comprehensive analysis, we observed that these models frequently abandon promising reasoning paths prematurely, leading to inefficient problem-solving and lower accuracy. We introduced a novel metric to quantify underthinking by assessing token efficiency in incorrect responses. To mitigate this issue, we proposed a decoding strategy with a thought switching penalty (TIP), which encourages models to thoroughly explore each reasoning thought before considering alternatives. Our empirical results demonstrate that TIP effectively reduces underthinking and enhances accuracy across difficult mathematical and scientific problem sets without necessitating additional model training. This work contributes to a deeper understanding of reasoning processes in o1-like LLMs and provides a practical approach to align their problem-solving capabilities.