INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Why does self-revision increase mo…›this inquiring line

An AI revising its own work usually amplifies its mistakes — but critique from a separate model genuinely improves accuracy.

Why does external critique improve revision while internal self-assessment fails?

This explores why a model that gets feedback from an outside source revises well, while a model grading its own work tends to make things worse — and what the corpus says is actually doing the work.

This explores why external critique improves revision while internal self-assessment fails — and the corpus's sharpest answer is that the *act* of revising isn't what helps or hurts; the *source* of the critique is. One study makes this almost surgically clear: revision guided by an external model raises accuracy, but a model revising its own uncertain output usually just amplifies confidence in the wrong answer rather than fixing it Does revising your own reasoning actually help or hurt?. Self-revision in strong reasoning models (QwQ, R1, LIMO) mostly preserves wrong answers, and smaller models frequently flip correct answers to incorrect — longer chains with more revisions actually correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?.

The mechanism behind the failure has a name: degeneration of thought. When a model reconsiders an answer using its own prior reasoning, it doesn't have an independent vantage point — it's checking its work against the same flawed prior that produced the error, so it converges toward false confidence instead of away from it Does a model improve by arguing with itself?. The fix in that same work is telling: replace the single self with *genuinely different* models in debate, and the pattern reverses — both accuracy and calibration improve. Difference, not introspection, is the active ingredient.

This is why "pure" self-improvement keeps hitting a wall. One synthesis argues that methods which look self-contained almost always smuggle in an external anchor — a past model version, a third-party judge, a user correction, a tool's output — because unaided self-improvement stalls on the generation–verification gap, diversity collapse, and reward hacking Can models reliably improve themselves without external feedback?. The deep problem is that a model's ability to *verify* an answer isn't reliably better than its ability to *generate* one, so it has no leverage to correct itself from the inside.

What makes external critique different isn't just that it catches errors at test time. Critique signal injected during training counteracts "tail narrowing" — it keeps the model's solution space diverse instead of prematurely collapsing onto its favorite answer Do critique models improve diversity during training itself?. And training a model to *critique* noisy responses produces deeper understanding than training it to imitate correct ones, because critique forces engagement with how things fail rather than copying surface patterns Does critiquing errors teach deeper understanding than imitating correct answers?. That connects to a quieter finding worth knowing: imitation training captures a confident, fluent *style* without closing any real capability gap Can imitating ChatGPT fool evaluators into thinking models improved? — and self-assessment that flatters its own style is the same trap viewed from inside.

The interesting wrinkle is that internal self-assessment isn't doomed in principle — it's doomed when it has nothing external to ground it. Approaches that get self-judging to work do so by manufacturing an outside-like signal: SERL has the model alternate between generating and *ranking* responses, deriving reward from the consistency between independent judgments rather than from a single self-endorsement Can models learn to judge themselves without external rewards?, and Post-Completion Learning trains self-evaluation in unused sequence space so the model internalizes an evaluation function rather than rubber-stamping its first output Can models learn to evaluate their own work during training?. The throughline across all of it: revision works when something independent — a different model, a ranking, a held-out judge — breaks the loop of a mind checking itself against itself.

Sources 9 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Show all 9 sources

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate2.49 match · arxiv ↗
Post-Completion Learning for Language Models2.48 match · arxiv ↗
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models2.44 match · arxiv ↗
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration2.42 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future1.77 match · arxiv ↗
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge1.73 match · arxiv ↗
Self-Rewarding Language Models1.71 match · arxiv ↗
Self-Questioning Language Models1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about why external critique improves revision while internal self-assessment fails in LLMs. The question remains open: what are the *actual* constraints on self-improvement now?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–08/2025. Key constraints identified:
- Self-revision in strong reasoning models (QwQ, R1, LIMO) preserves wrong answers; longer chains correlate with *lower* accuracy (~2024–2025).
- "Degeneration of thought": a model checking its work against its own prior reasoning converges toward false confidence instead of correction (~2024).
- Pure self-improvement hits a wall on the generation–verification gap; verification ability ≠ generation ability (~2024–2025).
- External critique during training counteracts "tail narrowing" and maintains solution diversity (~2024–2025).
- Training to *critique* noisy responses produces deeper understanding than training to imitate correct ones; imitation captures confident style without closing capability gaps (~2023–2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.15717 (2023-05): The False Promise of Imitating Proprietary LLMs
- arXiv:2404.09129 (2024-04): When Hindsight is Not 20/20: Testing Limits on Reflective Thinking
- arXiv:2501.17703 (2025-01): Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- arXiv:2507.20252 (2025-07): Post-Completion Learning for Language Models

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, probe whether newer inference methods (chain-of-thought variants, test-time scaling, diffusion-based reasoning), training regimes (RL with richer reward shaping, mixture-of-experts, constitutional AI variants), or evaluation protocols have since *relaxed* the generation–verification gap or made self-revision more competitive with external signals. Separate durable question (likely: does independence of signal source matter?) from perishable limitation (possibly: does self-revision *always* fail with current training?). Cite what resolved or reconfirmed each.
(2) Surface the strongest *disagreement* or *superseding* work from the last ~6 months—especially any showing self-improvement *without* external anchors, or external critique that *failed* to improve revision, or tension between the critique-training paradigm and recent RL scaling results.
(3) Propose 2 research questions that assume the regime may have moved: e.g., does test-time diffusion (arXiv:2507.16075) provide a learned, model-internal independence that mimics external signal? Can temporal self-rewarding (arXiv:2508.06026) decouple internal critique from self-reinforcement?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI revising its own work usually amplifies its mistakes — but critique from a separate model genuinely improves accuracy.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8