Can language models accurately evaluate the quality of their own ideas?
This explores whether an LLM can judge the quality of its own outputs — and the corpus is fairly blunt: self-evaluation is systematically biased, and reliable judgment seems to need something external.
This explores whether a language model can accurately rate the quality of its own ideas — not whether it *produces* good ideas, but whether it can *tell* which of its own outputs are good. The corpus leans toward a clear answer: not reliably, and the failure is structural rather than incidental. The most direct evidence is that models over-trust what they themselves generated — a high-probability answer simply *feels* more correct when the same model grades it, creating a self-agreement loop that only breaks when the answer is compared against broader alternatives Why do models trust their own generated answers?. So an LLM grading its own idea is partly grading its own fluency.
There's a deeper formal limit underneath that bias. Self-improvement is bounded by a generation–verification gap: a model can generate, but every reliable correction needs something outside the model to validate and enforce it, and metacognition alone can't close that gap What stops large language models from improving themselves?. This connects to a surprising finding about self-knowledge — models can describe their own learned behaviors without being trained to, yet those self-reports are unstable, shift under conversational pressure, and don't reflect genuine self-understanding How well do language models understand their own knowledge?. If a model doesn't have stable access to *what it knows*, asking it to accurately rate *how good its idea is* inherits the same wobble.
Why is introspective evaluation so shaky? A few notes point at the machinery. Reasoning traces turn out to be persuasive performance rather than verified computation — invalid logical steps score about as well as valid ones, so the 'explanation' a model gives for why an idea is good isn't actually the thing producing the answer Do reasoning traces show how models actually think?. Generation itself flows smoothly toward the training distribution rather than exploring competing claims, so a model isn't naturally weighing its idea against the strongest counterposition while producing it Does LLM generation explore competing claims while producing text?. And 'Potemkin understanding' shows explanation and application can be functionally disconnected — a model can correctly explain a concept, fail to apply it, and even recognize the failure, which means self-assessment and actual competence run on separate tracks Can LLMs understand concepts they cannot apply?.
The more interesting turn is what *does* work. The corpus suggests self-evaluation becomes reliable when you stop relying on a single model judging itself in the moment and instead build in an external or adversarial check. Asymmetric self-play replaces self-grading with a proposer–solver split and majority-vote verification, letting models improve with no human labels precisely because verification is structurally separated from generation Can language models improve themselves without any external training data?. Post-Completion Learning trains a model to compute its own reward in unused sequence space, internalizing evaluation during training rather than trusting in-the-moment confidence at inference Can models learn to evaluate their own work during training?. The common thread: the fix isn't 'try harder to introspect,' it's 'engineer a verification step that doesn't share the generator's biases.'
So the thing you might not have known you wanted to know: an LLM's confidence in its own idea is partly a measure of how *probable* that idea was to generate, not how *good* it is — which is why comparison, adversarial framing, and externalized verification consistently beat a model asked to grade its own work Why do models trust their own generated answers? What stops large language models from improving themselves?.
Sources 8 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.