INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Why does self-revision increase mo…›this inquiring line

Having an AI critique its own plan is like grading your own test — the same blind spots do both jobs.

Why does self-critiquing actually reduce plan quality in language models?

This explores why turning a model loose on its own plans — having it critique and revise what it just produced — can make the output worse instead of better, and what in the corpus explains that backfire.

This reads the question as being about a specific failure: not 'does critique help' in general, but why self-critique in particular degrades a plan a model already made. The corpus points to one root cause with several faces — the critic and the author are the same model, sharing the same blind spots. The cleanest statement of the mechanism is that language models carry a structural bias toward trusting answers they generated themselves: a high-probability output 'feels' correct precisely because the model assigned it high probability, so when that same model is asked to judge it, the judgment is contaminated by the generation Why do models trust their own generated answers?. Self-critique isn't a fresh pair of eyes; it's the same eyes grading their own homework.

There's a deeper, almost formal version of this. Self-improvement in LLMs is bounded by what's called the generation-verification gap — a model can only reliably fix what it can independently verify, and metacognition alone doesn't supply that external check What stops large language models from improving themselves?. When verification is no stronger than generation, a critique pass adds confident-sounding edits without adding real signal, which is exactly the regime where revisions drift away from a decent first plan. The same work argues the fix has to be *externalized* rather than learned introspectively What actually constrains large language models from self-improvement? — which reframes the whole question: self-critique reduces quality because it pretends to be the external check it structurally cannot be.

Introspection is the other shoe. When a model 'explains why' a plan is weak, it's usually not reading its own internal process — its self-reports mostly echo patterns in the training data rather than genuine inspection of what it actually did Can language models actually introspect about their own states?. So the critique is plausible narrative, not diagnosis, and acting on a plausible-but-ungrounded critique is how a sound plan gets 'corrected' into a worse one. Two adjacent failure modes make this concrete: models lock into premature assumptions early and can't recover them later in a conversation Why do language models fail in gradually revealed conversations?, and they exhibit face-saving avoidance — declining to flatly contradict a claim even when they know better Why do language models avoid correcting false user claims?. A self-critic inherits both: it tends to rationalize its initial commitments and to soften the very corrections that would help.

The most useful turn here is what the corpus says *does* work, because it tells you why naive self-critique doesn't. Training a model to correct itself from its own offline 'here's the fix' traces fails — the errors it sees in training don't match the errors it makes at test time, and it collapses into one stock correction move; what works is online reinforcement learning under the model's *actual* error distribution, letting it practice fixing real mistakes Why does self-correction training on offline data fail?. The other working pattern is to break the self-agreement loop with genuine externality: an asymmetric proposer/solver setup where one part generates problems and another verifies by majority vote Can language models improve themselves without any external training data?, or post-completion training that builds a separate evaluation pass into the model rather than bolting critique on at inference Can models learn to evaluate their own work during training?.

The thing worth walking away with: self-critique doesn't fail because models are bad at criticism — it fails because asking a model to critique itself violates the one condition under which critique improves anything, namely that the verifier be independent of the generator. The collection's own self-knowledge thread shows models *do* have real, causal mechanisms for tracking what they don't know Do models know what they don't know? — so the answer isn't 'models can't self-assess at all,' it's that bolt-on self-critique routes around those mechanisms and leans on the biased, narrative-generating part instead. Build in the externality and self-evaluation helps; skip it and you get confident revision toward worse plans.

Sources 10 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Show all 10 sources

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models2.57 match · arxiv ↗
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing2.56 match · arxiv ↗
SPICE: Self-Play In Corpus Environments Improves Reasoning2.56 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future2.56 match · arxiv ↗
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models2.46 match · arxiv ↗
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models1.76 match · arxiv ↗
Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation1.76 match · arxiv ↗
Self-Improving Model Steering1.74 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about self-critique degradation in language models. The question remains: under what conditions does self-critique actually harm plan quality, and have those conditions shifted?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2025; treat all as time-bound:
• Models exhibit structural self-agreement bias: high-probability outputs feel correct to the same model that generated them, contaminating critique (2024-03).
• The generation-verification gap prevents self-improvement unless externalized; introspective learning fails because models cannot independently verify their own flaws (2024-12).
• Self-reports in critiques mostly echo training data patterns, not genuine introspection; models rationalize initial commitments and avoid contradicting themselves (2025-06).
• SFT on model-generated correction traces fails due to train-test distribution mismatch; online RL under actual error distribution works (2024-09).
• Breaking the self-agreement loop via asymmetric proposer/solver setups or post-completion evaluation training succeeds where bolt-on critique fails (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2403.09972 (2024-03) — self-detection and structural bias
• arXiv:2412.02674 (2024-12) — generation-verification gap
• arXiv:2506.05068 (2025-06) — introspection as narrative generation
• arXiv:2507.20252 (2025-07) — post-completion learning

Your task:
(1) RE-TEST each constraint. Has the self-agreement bias been experimentally relaxed by newer model scaling, separate encoder/critic architectures, or adversarial training? Do recent evals (2025-08 onwards) show online RL or externalized critique actually solving the distribution-mismatch problem in production? Where does introspection still fail?
(2) Surface the strongest contradicting work: any papers showing self-critique *does* improve plans under specific conditions, or showing the generation-verification gap is narrower than claimed.
(3) Propose two questions: (a) Can a unified self-critique mechanism emerge if models are trained to *track uncertainty* in their own planning, not just revise? (b) Does modular fine-tuning of a separate critique head (rather than inference-time prompting) dissolve the bias problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Having an AI critique its own plan is like grading your own test — the same blind spots do both jobs.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8