Does external critique guide revision better than internal self-assessment during model training?
This explores whether feedback from a separate critic outperforms a model checking its own work — and the corpus suggests the real dividing line isn't 'external vs. internal' but whether the revision signal is independent of the same uncertain reasoning that produced the error.
This explores whether feedback from a separate critic beats a model grading itself during training. The cleanest answer in the corpus is yes: when a model revises based on an external critic it gets more accurate, but when it revises its own uncertain output it usually just doubles down — amplifying confidence in wrong answers rather than fixing them Does revising your own reasoning actually help or hurt?. The revision *act* is neutral; the *source* of the critique decides the outcome.
Why does self-assessment curdle? Because a model reconsidering its own previous reasoning is reasoning in a closed loop — the same distribution that generated the mistake is now judging the mistake. This shows up as 'degeneration of thought,' where single-model self-revision hardens errors, while debate among genuinely *different* models reverses the pattern and improves calibration Does a model improve by arguing with itself?. It also shows up at the limit: pure self-improvement stalls on the generation–verification gap and diversity collapse, and the methods that actually keep working turn out to be quietly smuggling in external anchors — past model versions, third-party judges, user corrections, tool feedback Can models reliably improve themselves without external feedback?. Even much of what looks like self-correction is theater: across eight reasoning models, reflection rarely changes the answer and mostly serves as post-hoc confirmation Is reflection in reasoning models actually fixing mistakes?.
The more interesting move is to ask *why* external critique helps during training specifically, not just at test time. Step-level critique inside the training loop counteracts tail-narrowing — it keeps the model exploring diverse solutions across self-training rounds instead of prematurely converging, a benefit more fundamental than raw accuracy gains Do critique models improve diversity during training itself?. And training a model to critique noisy, wrong responses builds deeper understanding than training it to imitate correct ones, because engaging with failure modes forces structural reasoning that surface-pattern imitation never touches Does critiquing errors teach deeper understanding than imitating correct answers?. Imitation alone captures style without closing any capability gap Can imitating ChatGPT fool evaluators into thinking models improved?.
But here's the twist that keeps this from being 'external always wins': internal self-assessment isn't doomed — it fails when it's *circular*, and succeeds when it's *grounded in an independent signal*. Self-Examining RL has a model alternate between answering and judging, deriving rewards from ranking consistency rather than from trusting its own confidence, and lifts performance without any external reward Can models learn to judge themselves without external rewards?. Post-completion learning trains a model to compute its own reward in the unused space after its output, internalizing evaluation at zero inference cost Can models learn to evaluate their own work during training?. The thread connecting these to the failures: introspection that merely echoes the training distribution is unreliable, but self-reports tied to a genuine causal chain can carry real signal Can language models actually introspect about their own states?.
So the takeaway the question doesn't quite anticipate: 'external vs. internal' is the wrong axis. What separates revision that helps from revision that harms is whether the feedback breaks out of the model's own confidence loop — via a different judge, a consistency constraint, or a causal anchor — or just relabels the original guess. A model can critique itself well, but only if it stops trusting itself to be right.
Sources 10 notes
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.