INQUIRING LINE

How does self-revision on wrong answers increase model confidence further?

This explores why a model that re-examines its own wrong answer tends to dig in harder rather than fix the mistake — and what the corpus says is actually happening when a model 'reconsiders.'


This explores the strange dynamic where asking a model to revise a wrong answer often makes it *more* sure of being wrong, not less. The corpus traces this back to a structural bias: models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* correct when the same model evaluates it Why do models trust their own generated answers?. Self-revision feeds that loop. When a model reconsiders based only on its own prior reasoning, it isn't sampling a fresh perspective — it's re-encountering the same confident error and ratifying it, a failure mode sharp enough to have its own name: degeneration of thought Does a model improve by arguing with itself?.

The clearest finding in the collection is that the *source* of the critique, not the act of revising, decides the outcome. Revision guided by an external model improves accuracy; a model revising its own uncertain output usually amplifies confidence in the wrong answer instead of correcting it Does revising your own reasoning actually help or hurt?. Several papers converge on this from different angles: across QwQ, R1, and LIMO, most revisions *retain* the wrong answer, and smaller models even flip correct answers to incorrect — with longer revision chains correlating with lower accuracy, not higher Does self-revision actually improve reasoning in language models?. An analysis of eight reasoning models goes further, suggesting much of what looks like 'reflection' is theater: reflections rarely change the answer and mostly serve as post-hoc confirmation of the first guess Is reflection in reasoning models actually fixing mistakes?.

There's a mechanical reason the spiral worsens over a long chain. Once an error sits in the context window, it biases everything that follows — models degrade non-linearly when their own prior mistakes contaminate their working history, and a wrong step in context actively conditions the next wrong step Do models fail worse when their own errors fill the context?Can model confidence work as a reward signal for reasoning?. So self-revision doesn't just fail to correct; it can manufacture new evidence (the model's own confident-but-wrong reasoning) that the model then treats as support. The thing you'd hope breaks the loop — looking again — is exactly what tightens it.

What actually reverses it is introducing genuine difference. Multi-agent debate between *different* models breaks the self-agreement loop and improves both accuracy and calibration, because the disagreement supplies the outside view a single model can't generate for itself Does a model improve by arguing with itself?. And training-time approaches that work tend to make the model practice on its *actual* mistakes under reinforcement learning, rather than rehearse idealized correction traces — offline self-correction data fails precisely because the model's training-time errors don't match the errors it makes at test time Why does self-correction training on offline data fail?.

The quietly unsettling takeaway: confidence and correctness come apart here in a way that's easy to miss. A confident model is more robust to prompt rephrasing Does model confidence predict robustness to prompt changes?, and confidence can even be harnessed as a useful reward signal that *restores* calibration when wired in deliberately Can model confidence work as a reward signal for reasoning?Can model confidence alone replace external answer verification? — but left to revise itself unsupervised, a model's growing confidence is just as likely to be a measure of how thoroughly it has talked itself into the wrong answer.


Sources 10 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about self-revision failure in LLMs. The question: does self-revision on wrong answers systematically increase model confidence further, and if so, why—and has that changed?

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot:
• Models over-trust their own prior outputs; self-revision guided by external critique improves accuracy, but single-model self-revision amplifies confidence in wrong answers instead (~2024–2025).
• Across QwQ, R1, LIMO: most self-revisions retain the wrong answer; longer revision chains correlate with lower accuracy (~2025).
• "Reflection" in reasoning models is mostly theater—reflections rarely change answers and serve post-hoc confirmation (~2024).
• Self-conditioning: once an error enters context, it biases downstream reasoning non-linearly; models degrade when their own prior mistakes contaminate working history (~2025).
• Multi-agent debate between different models breaks the self-agreement loop and improves both accuracy and calibration (~2023–2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.09129 (Apr 2024): "When Hindsight is Not 20/20" — reflective thinking limits.
• arXiv:2409.12917 (Sep 2024): "Training Language Models to Self-Correct via RL" — training-time fixes.
• arXiv:2502.12215 (Feb 2025): "Revisiting Test-Time Scaling of o1-like Models" — scaling dynamics.
• arXiv:2509.09677 (Sep 2025): "The Illusion of Diminishing Returns" — long-horizon execution failure.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, assess whether newer models (o1-pro, Claude 4, etc.), training methods (RL-from-self-feedback post-completion learning, verifier-free RLVR), tooling (multi-turn harnesses, thought caching), or orchestration (multi-agent + memory) have RELAXED or OVERTURNED it. Flag where the constraint still holds and where it may have dissolved. Separate durable question ("Why does self-revision fail?") from perishable limitation ("All self-revision fails").
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has any paper since Sep 2025 shown self-revision *does* work under specific conditions, or that confidence-calibration can be fixed at scale?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Can verifier-free RL induce reliable self-correction without external critique?", "Does thought caching + multi-turn self-debate break the confidence spiral?"

Cite arXiv IDs; flag anything you cannot ground in a real paper. 👇

Next inquiring lines