LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR2Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR2Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. Our extensive evaluation on both conventional LLMs and LRMs reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR2Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs. 1
Introduction. Recent advancements in Large Reasoning Models (LRMs), exemplified by QwQ-32B, DeepSeek-R1, and OpenAI-o1 (Qwen, 2024; Guo et al., 2025; OpenAI, 2024a), have demonstrated substantial progress in the reasoning capabilities of Large Language Models (LLMs). These models exhibit more human-like behaviors, such as making assumptions,
Discussion / Conclusion. This paper introduces LR2Bench, a novel benchmark to comprehensively evaluate the reflection capabilities of LLMs in long-chain reasoning. LR2Bench comprises six tasks with varying difficulty levels, providing a thorough analysis across diverse scenarios. The experimental results show that LRMs outperform conventional LLMs, demonstrating their superior performance on reflective reasoning. Our findings also highlight the limitation of current reasoning LLMs and reveal that even the most advanced reasoning models fall short of achieving satisfactory performance, suggesting significant room for enhancement in reflective reasoning capabilities.