INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Why does self-revision increase mo…›this inquiring line

When an AI reconsiders its own answer, it usually makes things worse — but critique from a different model actually helps.

Does internal self-revision actually degrade reasoning accuracy in models?

This explores whether a model reworking its own reasoning (self-revision, reflection, second-guessing) makes its answers worse — and the corpus says the answer turns on *who* is doing the revising and *how the model was trained*, not on the act of revision itself.

This explores whether a model reworking its own reasoning actually hurts accuracy. The short version from the corpus: revising is not the problem — revising *yourself* is. The cleanest statement comes from work showing that the revision source determines the outcome: when an external model critiques the reasoning, accuracy improves, but when a model second-guesses its own uncertain output it usually just amplifies confidence in the wrong answer instead of fixing it Does revising your own reasoning actually help or hurt?. Direct evidence from o1-style reasoning models backs this up — across QwQ, R1, and LIMO, most revisions keep the wrong answer, smaller models often flip *correct* answers to incorrect mid-revision, and longer chains with more revisions correlate with lower accuracy Does self-revision actually improve reasoning in language models?.

Why does this happen? Two mechanisms show up repeatedly. First, models have a built-in bias toward trusting things they themselves generated: a high-probability self-generated answer simply *feels* more correct when the same model evaluates it, so self-checking collapses into self-agreement Why do models trust their own generated answers?. Second, when a single model keeps arguing with its own prior reasoning, it slides into a failure mode where it grows *more* certain of errors rather than less — and the fix is diversity: debate between genuinely different models reverses the pattern and improves both accuracy and calibration Does a model improve by arguing with itself?. The common thread is that a model has no independent vantage point on itself; the corrective signal has to come from outside the loop.

There's an even more deflating finding worth sitting with: a lot of what looks like self-correction isn't correction at all. Analysis across eight reasoning models found that reflection rarely changes the final answer and mostly serves as post-hoc confirmation of the first answer — and training on longer reflection chains improves the *first answer's* quality, not the model's ability to fix itself Is reflection in reasoning models actually fixing mistakes?. In the same spirit, frontier models that sound fluent while reflecting hit only 20–23% on constraint-satisfaction problems that demand real backtracking, showing that reflective *fluency* doesn't translate into reflective *competence* Can reasoning models actually sustain long-chain reflection?.

But here's the turn that makes this more than a 'self-revision is bad' story: the behavior is trainable. Vanilla models use extended thinking counterproductively — it induces self-doubt that degrades performance — yet RL training redirects that exact same mechanism into productive gap analysis, so training, not the act of thinking, mediates quality Does extended thinking help or hurt model reasoning?. Other approaches close the loop from inside in disciplined ways: using the model's own answer-span confidence as a reward signal strengthens step-by-step reasoning while *restoring* calibration Can model confidence work as a reward signal for reasoning?, and post-completion learning trains genuine self-evaluation into the model rather than letting it improvise self-critique at inference Can models learn to evaluate their own work during training?.

The thing you might not have expected to learn: even setting aside *who* revises, more revision is its own hazard because it usually means more thinking, and thinking has an optimum. Accuracy follows an inverted-U with chain length — one model dropped from 87% to 70% as thinking tokens climbed from ~1,100 to ~16K Does more thinking time always improve reasoning accuracy? Why does chain of thought accuracy eventually decline with length? — and much of that waste is models abandoning good reasoning paths too early, which a simple penalty on thought-switching can fix without retraining Do reasoning models switch between ideas too frequently?. So 'does self-revision degrade accuracy?' resolves into something sharper: unguided self-revision tends to degrade it, longer self-revision degrades it past a point, but externally-guided or training-instilled revision is exactly where the gains live.

Sources 12 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Show all 12 sources

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher evaluating whether internal self-revision degrades LLM accuracy—a question a curated library (2023–2025) has mapped across multiple angles. Treat the findings below as dated claims needing re-test, not current truth.

**What a curated library found — and when:**
- Self-revision by the model itself amplifies confidence in wrong answers; external critique improves accuracy (2024).
- Models exhibit inherent bias toward trusting their own prior outputs; debate between *different* models fixes this (2024).
- Reflection rarely changes final answers; training on longer chains improves *first-answer* quality, not correction capacity (2024).
- Frontier models achieve only 20–23% on constraint-satisfaction tasks demanding real backtracking, showing fluent reflection ≠ reflective competence (2024).
- Reasoning accuracy follows an inverted-U with thinking tokens; past ~16K tokens, accuracy drops (e.g., 87%→70%); underthinking (premature thought-switching) is a distinct failure mode (2025).
- RL training and post-completion learning can redirect extended thinking from self-doubt into productive gap analysis, restoring calibration (2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2403.09972 (2024): Self-detection bias in LLMs
- arXiv:2404.09129 (2024): Limits of reflective thinking
- arXiv:2501.18585 (2025): Underthinking in o1-like models
- arXiv:2507.21931 (2025): RL from self-feedback

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each finding—especially the inverted-U threshold and the claim that unguided self-revision degrades accuracy—judge whether newer models (post-July 2025), training methods (new RL curricula, DPO variants), or inference harnesses (memory, routing, multi-critic systems) have relaxed or overturned it. Separate durable (e.g., *unguided* self-revision faces an inherent self-agreement problem) from perishable (e.g., the 16K-token threshold, specific accuracy drop rates).

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months—especially any showing that calibrated self-revision, model ensembles, or novel RL signals have restored self-correction's value. Flag disagreement on whether reflection is "mostly theater" vs. genuinely foundational to reasoning.

(3) **Propose 2 research questions** that assume the regime may have shifted:
   - Does externally-anchored confidence (e.g., peer-model agreement signals fed back at inference) allow a model to overcome its own self-agreement bias?
   - Can post-completion learning or RL make a single model's self-revision as effective as inter-model debate *without* ensemble overhead?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

When an AI reconsiders its own answer, it usually makes things worse — but critique from a different model actually helps.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8