Can reasoning models actually sustain long-chain reflection?

Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.

Synthesis note · 2026-05-02 · sourced from Reasoning Methods CoT ToT

LR²Bench takes the central marketing claim of Large Reasoning Models — that they can sustain long-chain reflective reasoning, making assumptions, backtracking, and self-refining over many steps — and tests it where the claim cannot be faked by surface fluency. The benchmark consists of 850 Constraint Satisfaction Problems across six task families (knowledge-based, logical, spatial). DeepSeek-R1 averages 20.0% Exact Match. OpenAI o1-preview averages 23.6%. These are the frontier LRMs, on tasks designed to require exactly the capability they were trained for.

CSPs are the right test because they are unforgiving in a specific way. A CSP either satisfies all constraints or it doesn't — there is no partial-credit reading where the trace looks plausible. Reflection in CSPs requires real backtracking: when a partial assignment violates a constraint, the solver must abandon a branch and try another. Surface-level "wait, let me reconsider" does not satisfy a constraint that was just violated. The 20-23% ceiling means that on 80% of these problems, reflective fluency fails to convert into reflective competence.

This converges with Does the reasoning cliff depend on how we test models?: text-only LRM evaluation reveals the cliff that tool-augmented evaluation often hides. It also converges with Do language models fail at reasoning due to complexity or novelty? — frontier LRMs are not failing on long chains in general, they are failing on chains whose instance structure was not in training. CSPs are precisely such structure: each instance is a fresh combinatorial space.

The methodological provocation is that CSPs are exactly where Can symbolic solvers fix how LLMs reason about logic? would predict tool-enabled rescue. The 20% number is the unaided ceiling. Whether tool access closes the gap is the next question; without tools, the gap is large enough to call long-chain reflection "theatrical" in the technical sense — fluent, well-formed, and not actually doing the work.

Inquiring lines that read this note 168

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can reasoning models actually sustain long-chain reflection?

Inquiring lines that read this note 168

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4