Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Paper · arXiv 2503.19855 · Published March 25, 2025

Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach—Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multiround Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt:

Introduction. Inference test-time compute (Yang et al., 2025; Wu et al., 2025) refers to the computational resources utilized by large language models (LLMs) during the generation of prompt responses, distinct from the training compute used for model creation and refinement. Leveraging step-by-step reasoning has shown substantial improvements in solving complex tasks by explicitly providing models with intermediate reasoning steps(Lightman et al., 2023; Wei et al., 2023), significantly enhancing accuracy. In recent years, the performance improvements of language models have largely depended on massive-scale self-supervised pre-training (Kaplan et al., 2020; Hoffmann et al., 2022), scaling up training-time compute. However, as advancements in training-time scaling slow, increasing attention is turning towards scaling up test-time compute (Muennighoff et al., 2025; Chen et al., 2025). OpenAI (OpenAI, 2024a) pioneered this approach with their o1 series models (OpenAI, 2024b) using large-scale reinforcement learning (RL).

Discussion / Conclusion. In this study, we proposed Multi-round Thinking, a straightforward yet effective test-time scaling strategy designed to enhance the reasoning capabilities of large language models (LLMs). Inspired by human cognitive processes, this iterative approach allows models to refine their reasoning by independently reconsidering their previous answers, significantly mitigating cognitive inertia and correcting initial reasoning errors. Our extensive experiments demonstrated consistent and substantial improvements across challenging benchmarks, including AIME 2024, GPQA-Diamond, MATH-500, and LiveCodeBench. For instance, accuracy improved by more than 2 percentage points on complex mathematical competition tasks, underscoring the broad applicability and practical value of this approach. Further analysis revealed that multi-round reasoning not only improved accuracy but also made the models’ reasoning more concise and confident.

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Synthesis notes that discuss concepts related to this paper