INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Why does self-revision increase mo…›this inquiring line

When an AI checks its own work, it tends to get more confident in its mistakes — not fix them.

Does external critique guide revision better than internal self-assessment during model training?

This explores whether feedback from a separate critic outperforms a model checking its own work — and the corpus suggests the real dividing line isn't 'external vs. internal' but whether the revision signal is independent of the same uncertain reasoning that produced the error.

This explores whether feedback from a separate critic beats a model grading itself during training. The cleanest answer in the corpus is yes: when a model revises based on an external critic it gets more accurate, but when it revises its own uncertain output it usually just doubles down — amplifying confidence in wrong answers rather than fixing them Does revising your own reasoning actually help or hurt?. The revision *act* is neutral; the *source* of the critique decides the outcome.

Why does self-assessment curdle? Because a model reconsidering its own previous reasoning is reasoning in a closed loop — the same distribution that generated the mistake is now judging the mistake. This shows up as 'degeneration of thought,' where single-model self-revision hardens errors, while debate among genuinely *different* models reverses the pattern and improves calibration Does a model improve by arguing with itself?. It also shows up at the limit: pure self-improvement stalls on the generation–verification gap and diversity collapse, and the methods that actually keep working turn out to be quietly smuggling in external anchors — past model versions, third-party judges, user corrections, tool feedback Can models reliably improve themselves without external feedback?. Even much of what looks like self-correction is theater: across eight reasoning models, reflection rarely changes the answer and mostly serves as post-hoc confirmation Is reflection in reasoning models actually fixing mistakes?.

The more interesting move is to ask *why* external critique helps during training specifically, not just at test time. Step-level critique inside the training loop counteracts tail-narrowing — it keeps the model exploring diverse solutions across self-training rounds instead of prematurely converging, a benefit more fundamental than raw accuracy gains Do critique models improve diversity during training itself?. And training a model to critique noisy, wrong responses builds deeper understanding than training it to imitate correct ones, because engaging with failure modes forces structural reasoning that surface-pattern imitation never touches Does critiquing errors teach deeper understanding than imitating correct answers?. Imitation alone captures style without closing any capability gap Can imitating ChatGPT fool evaluators into thinking models improved?.

But here's the twist that keeps this from being 'external always wins': internal self-assessment isn't doomed — it fails when it's *circular*, and succeeds when it's *grounded in an independent signal*. Self-Examining RL has a model alternate between answering and judging, deriving rewards from ranking consistency rather than from trusting its own confidence, and lifts performance without any external reward Can models learn to judge themselves without external rewards?. Post-completion learning trains a model to compute its own reward in the unused space after its output, internalizing evaluation at zero inference cost Can models learn to evaluate their own work during training?. The thread connecting these to the failures: introspection that merely echoes the training distribution is unreliable, but self-reports tied to a genuine causal chain can carry real signal Can language models actually introspect about their own states?.

So the takeaway the question doesn't quite anticipate: 'external vs. internal' is the wrong axis. What separates revision that helps from revision that harms is whether the feedback breaks out of the model's own confidence loop — via a different judge, a consistency constraint, or a causal anchor — or just relabels the original guess. A model can critique itself well, but only if it stops trusting itself to be right.

Sources 10 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Show all 10 sources

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models3.26 match · arxiv ↗
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate2.49 match · arxiv ↗
Post-Completion Learning for Language Models2.49 match · arxiv ↗
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration2.42 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future1.77 match · arxiv ↗
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge1.73 match · arxiv ↗
Self-Rewarding Language Models1.71 match · arxiv ↗
Self-Questioning Language Models1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capabilities researcher re-testing claims about revision guidance in LLM training. The question remains: does external critique guide revision better than internal self-assessment? Here is what a curated library (spanning 2023–2025) found — and treat these as dated, possibly superseded claims:

**What the library found — and when:**
- External critique reverses self-revision failure: when models revise their own uncertain outputs, they amplify confidence in wrong answers; external critics prevent this (2024–2025).
- Single-model self-revision produces "degeneration of thought"; multi-model debate and third-party judges preserve calibration (2024).
- Reflection in reasoning models rarely changes final answers—it is mostly post-hoc confirmation theater (2024).
- Step-level critique *during training* counters tail-narrowing and preserves exploration diversity, a benefit deeper than test-time accuracy (2024–2025).
- Training to critique noisy responses builds deeper reasoning than imitation alone, which captures style without closing capability gaps (2025).
- Internal self-assessment succeeds when grounded in an *independent signal* (consistency ranking, causal anchors, post-completion rewards) rather than confidence circularity (2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2404.09129 (Apr 2024): Reflective thinking limits in LLMs.
- arXiv:2501.17703 (Jan 2025): Critique fine-tuning vs. imitation.
- arXiv:2507.20252 (Jul 2025): Post-completion learning for self-evaluation.
- arXiv:2508.06026 (Aug 2025): Temporal self-rewarding—decoupling chosen–rejected via past–future.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, probe whether newer models (o1-class reasoning, multimodal), scaled critique training, or orchestration (multi-agent debate, tool-grounded feedback loops) have *relaxed* the circular-reasoning trap. Separate the durable question (does feedback source matter?) from the perishable limitation (does self-assessment *always* fail?). Cite what resolved it; plainly flag where the constraint still holds.
(2) **Surface contradicting or superseding work from the last ~6 months.** Especially prize papers showing self-assessment *does* work when the model is trained to avoid confidence loops, or external critique *fails* under specific distributional conditions.
(3) **Propose 2 research questions that assume the regime may have moved:** (a) Can a model learn to self-critique by alternating between role-playing as a critic *and* reasoner, without external supervision? (b) Does test-time critique from an external model transfer to train-time self-critique if the model internalizes the critic's causal model?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI checks its own work, it tends to get more confident in its mistakes — not fix them.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8