Why do models trained on critique fail at self-critique despite strong other-model evaluation?
This explores a specific asymmetry: a model can be sharp at judging another model's work, yet stumble when turning that same critical eye on its own output — and asks why the skill doesn't transfer inward.
This explores why critique ability fails to turn inward — a model that scores well grading others can still rubber-stamp its own mistakes. The corpus points to a single culprit underneath: the model isn't missing the skill, it's fighting a bias toward trusting whatever it generated. One study found that LLMs systematically over-trust their own answers because a high-probability generated answer simply *feels* more correct during evaluation — a self-agreement loop that has nothing to do with whether the answer is right Why do models trust their own generated answers?. The same critique competence is present; it just gets overridden by the model's prior commitment to its own text.
That's why the failure compounds rather than corrects. When a model revises based only on its own previous reasoning, it tends to grow *more* confident in errors, not less — a distinct failure mode where self-revision amplifies wrong answers instead of catching them Does a model improve by arguing with itself?. The fix in that work is telling: genuine disagreement from a *different* model reverses the pattern and improves both accuracy and calibration. The variable that matters isn't 'can it critique' — it's 'is the thing being critiqued its own.' Other-model evaluation works precisely because the model has no stake in the other model's output.
There's also a training-data reason the inward version breaks. Teaching self-correction by fine-tuning on offline correction traces fails because the errors in the training data don't match the errors the model actually makes at test time, and the model collapses into a single rote correction mode Why does self-correction training on offline data fail?. So even a model explicitly trained to critique-and-fix can be critiquing a distribution of mistakes it never makes — strong on paper, useless on its own live errors. The repair was online RL on the model's *own* mistakes, which is just another way of grounding the critique in something the model can't pre-commit to.
The deeper frame is that pure self-evaluation is structurally circular. Reliable self-improvement methods that look like they run on internal signal almost always smuggle in an external anchor — a past model version, a third-party judge, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. Other-model evaluation *is* that external anchor; self-critique removes it and leaves the model marking its own homework. The methods that succeed at self-judgment work by manufacturing distance: SERL has the model alternate between author and judge and derives reward from ranking *consistency* rather than self-approval Can models learn to judge themselves without external rewards?, and in-training critique keeps solution diversity alive so the model doesn't prematurely converge on its own first guess Do critique models improve diversity during training itself?.
The thing worth carrying away: critique is genuinely a deeper skill than imitation — training on critiquing flawed answers builds more real understanding than copying correct ones Does critiquing errors teach deeper understanding than imitating correct answers?. The skill is real and it's learned. What doesn't transfer is *objectivity*, because objectivity was never a property of the model — it was a property of the gap between judge and judged. Close that gap and the same competent critic becomes its own most credulous fan.
Sources 7 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.