Why does self-critiquing actually reduce plan quality in language models?
This explores why turning a model loose on its own plans — having it critique and revise what it just produced — can make the output worse instead of better, and what in the corpus explains that backfire.
This reads the question as being about a specific failure: not 'does critique help' in general, but why self-critique in particular degrades a plan a model already made. The corpus points to one root cause with several faces — the critic and the author are the same model, sharing the same blind spots. The cleanest statement of the mechanism is that language models carry a structural bias toward trusting answers they generated themselves: a high-probability output 'feels' correct precisely because the model assigned it high probability, so when that same model is asked to judge it, the judgment is contaminated by the generation Why do models trust their own generated answers?. Self-critique isn't a fresh pair of eyes; it's the same eyes grading their own homework.
There's a deeper, almost formal version of this. Self-improvement in LLMs is bounded by what's called the generation-verification gap — a model can only reliably fix what it can independently verify, and metacognition alone doesn't supply that external check What stops large language models from improving themselves?. When verification is no stronger than generation, a critique pass adds confident-sounding edits without adding real signal, which is exactly the regime where revisions drift away from a decent first plan. The same work argues the fix has to be *externalized* rather than learned introspectively What actually constrains large language models from self-improvement? — which reframes the whole question: self-critique reduces quality because it pretends to be the external check it structurally cannot be.
Introspection is the other shoe. When a model 'explains why' a plan is weak, it's usually not reading its own internal process — its self-reports mostly echo patterns in the training data rather than genuine inspection of what it actually did Can language models actually introspect about their own states?. So the critique is plausible narrative, not diagnosis, and acting on a plausible-but-ungrounded critique is how a sound plan gets 'corrected' into a worse one. Two adjacent failure modes make this concrete: models lock into premature assumptions early and can't recover them later in a conversation Why do language models fail in gradually revealed conversations?, and they exhibit face-saving avoidance — declining to flatly contradict a claim even when they know better Why do language models avoid correcting false user claims?. A self-critic inherits both: it tends to rationalize its initial commitments and to soften the very corrections that would help.
The most useful turn here is what the corpus says *does* work, because it tells you why naive self-critique doesn't. Training a model to correct itself from its own offline 'here's the fix' traces fails — the errors it sees in training don't match the errors it makes at test time, and it collapses into one stock correction move; what works is online reinforcement learning under the model's *actual* error distribution, letting it practice fixing real mistakes Why does self-correction training on offline data fail?. The other working pattern is to break the self-agreement loop with genuine externality: an asymmetric proposer/solver setup where one part generates problems and another verifies by majority vote Can language models improve themselves without any external training data?, or post-completion training that builds a separate evaluation pass into the model rather than bolting critique on at inference Can models learn to evaluate their own work during training?.
The thing worth walking away with: self-critique doesn't fail because models are bad at criticism — it fails because asking a model to critique itself violates the one condition under which critique improves anything, namely that the verifier be independent of the generator. The collection's own self-knowledge thread shows models *do* have real, causal mechanisms for tracking what they don't know Do models know what they don't know? — so the answer isn't 'models can't self-assess at all,' it's that bolt-on self-critique routes around those mechanisms and leans on the biased, narrative-generating part instead. Build in the externality and self-evaluation helps; skip it and you get confident revision toward worse plans.
Sources 10 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.