INQUIRING LINE

Why do reasoning models struggle with self-evaluation and revision?

This explores why reasoning models are bad at checking and fixing their own work — and the corpus points to a single culprit: a model judging itself isn't a neutral judge.


This explores why reasoning models struggle to evaluate and revise themselves, and the corpus converges on a striking answer: the problem isn't that revision is hard, it's that a model grading its own homework is structurally biased toward agreeing with itself. Models systematically over-trust the answers they generated, because a high-probability output simply *feels* more correct during evaluation — so self-assessment becomes a self-agreement loop rather than a real check Why do models trust their own generated answers?. When a model then revises based on its own uncertain reasoning, it doesn't correct errors; it amplifies confidence in them Does a model improve by arguing with itself?.

The sharpest finding is that the *source* of the critique, not the act of revising, decides the outcome. Revision guided by an external critic improves accuracy; revision guided by the model's own internal judgment degrades it Does revising your own reasoning actually help or hurt?. This shows up directly in o1-style models: across QwQ, R1, and LIMO, most revisions keep the wrong answer, smaller models flip correct answers to incorrect, and longer chains with more revision steps correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. The escape hatch is diversity — multi-agent debate with genuinely different models breaks the self-agreement loop and improves both accuracy and calibration, precisely because the critic is no longer the same mind that produced the answer Does a model improve by arguing with itself?.

This reframes a lot of what looks like "reflection." Studies across eight models find that reasoning traces are mostly confirmatory theater: reflections rarely change the initial answer and largely serve as post-hoc justification Can we actually trust reasoning model outputs? Is reflection in reasoning models actually fixing mistakes?. Training on longer reflection chains improves the *first answer's* quality — but not the model's ability to self-correct. So the visible "thinking out loud" you see isn't the model catching itself; it's the model rationalizing where it already landed. Worse, calibration degrades under binary reward training and the monitoring signals are easily gamed.

There's a second layer worth knowing: even when self-evaluation works, the underlying reasoning may be too disorganized to revise toward anything better. Reasoning models "wander" — they explore invalid paths and abandon promising ones prematurely (underthinking) — so success probability collapses exponentially as problems get deeper Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. Frontier models hit just 20-23% on constraint-satisfaction problems that demand genuine backtracking, revealing that fluent-sounding reflection doesn't translate into the actual competence revision requires Can reasoning models actually sustain long-chain reflection?. And some of these breakdowns aren't even reasoning failures — they're execution failures, where a model knows the algorithm but can't carry out the multi-step procedure in text alone Are reasoning model collapses really failures of reasoning?.

The thing you didn't know you wanted to know: the cure for bad self-evaluation isn't "make the model reflect harder." More reflection often makes it more confidently wrong. The cure is *otherness* — an external critic, a differently-trained debate partner, or comparing the answer against a broad set of alternatives instead of against itself Why do models trust their own generated answers? Does revising your own reasoning actually help or hurt?. Self-correction, it turns out, may be the one thing a single model can't reliably do alone.


Sources 10 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-model researcher re-testing claims about self-evaluation and revision. The question remains: why do reasoning models struggle to grade and correct their own work?

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2023–Feb 2026. The library reports:
  • Models systematically over-trust their own outputs during self-assessment; self-revision guided by internal judgment degrades accuracy, while external critique improves it (2024–2025).
  • Across o1-style models (QwQ, R1, LIMO), most revisions keep wrong answers; longer reasoning chains correlate with *lower* accuracy (~2025).
  • Reasoning traces are largely post-hoc justification: reflections rarely change the initial answer, and training on longer chains improves first-answer quality but not self-correction ability (2024–2025).
  • Reasoning models "wander" — exploring invalid paths and abandoning promising ones prematurely — achieving only 20–23% on constraint-satisfaction problems requiring genuine backtracking (~2025).
  • Multi-agent debate with genuinely different models breaks the self-agreement loop and improves both accuracy and calibration (2023–2024).

Anchor papers (verify; mind their dates):
  • arXiv:2309.13007 (ReConcile, 2023) — consensus via diverse LLMs
  • arXiv:2403.09972 (Self-Detection, 2024) — comprehensive testing of LLM self-evaluation bias
  • arXiv:2505.20296 (Wandering Solution Explorers, 2025) — premature path abandonment in reasoning
  • arXiv:2509.07339 (Performative Thinking, 2025) — CoT length / problem complexity correlation

Your task:
(1) RE-TEST EACH CONSTRAINT. For the self-agreement-loop claim, check whether newer model scaling, training objectives (e.g., process-reward models, outcome-supervised reasoning at scale), or architectural changes (e.g., explicit uncertainty quantification, hierarchical planning modules) have since relaxed the bias or enabled genuine internal critique. Test whether the "external critic always wins" claim still holds under recent multi-round training or hybrid internal–external setups. Judge whether the wandering problem persists in frontier models or is offset by improved search heuristics or planning.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming self-refinement *can* work under specific conditions (e.g., chain-of-thought + uncertainty masking, or structured decomposition), or work showing that the library's findings are artifacts of weak baselines or task mismatch.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can a model learn to *distinguish* when to trust vs. distrust its own judgment, and does explicit calibration training enable selective self-correction? (b) Do process rewards or iterative preference learning decouple the self-agreement loop by rewarding *critique quality* rather than agreement?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines