INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How do evaluation biases undermine…›this inquiring line

Can AI tell which of its own ideas are good, or does it just trust whatever it wrote?

Can language models accurately evaluate the quality of their own ideas?

This explores whether an LLM can judge the quality of its own outputs — and the corpus is fairly blunt: self-evaluation is systematically biased, and reliable judgment seems to need something external.

This explores whether a language model can accurately rate the quality of its own ideas — not whether it *produces* good ideas, but whether it can *tell* which of its own outputs are good. The corpus leans toward a clear answer: not reliably, and the failure is structural rather than incidental. The most direct evidence is that models over-trust what they themselves generated — a high-probability answer simply *feels* more correct when the same model grades it, creating a self-agreement loop that only breaks when the answer is compared against broader alternatives Why do models trust their own generated answers?. So an LLM grading its own idea is partly grading its own fluency.

There's a deeper formal limit underneath that bias. Self-improvement is bounded by a generation–verification gap: a model can generate, but every reliable correction needs something outside the model to validate and enforce it, and metacognition alone can't close that gap What stops large language models from improving themselves?. This connects to a surprising finding about self-knowledge — models can describe their own learned behaviors without being trained to, yet those self-reports are unstable, shift under conversational pressure, and don't reflect genuine self-understanding How well do language models understand their own knowledge?. If a model doesn't have stable access to *what it knows*, asking it to accurately rate *how good its idea is* inherits the same wobble.

Why is introspective evaluation so shaky? A few notes point at the machinery. Reasoning traces turn out to be persuasive performance rather than verified computation — invalid logical steps score about as well as valid ones, so the 'explanation' a model gives for why an idea is good isn't actually the thing producing the answer Do reasoning traces show how models actually think?. Generation itself flows smoothly toward the training distribution rather than exploring competing claims, so a model isn't naturally weighing its idea against the strongest counterposition while producing it Does LLM generation explore competing claims while producing text?. And 'Potemkin understanding' shows explanation and application can be functionally disconnected — a model can correctly explain a concept, fail to apply it, and even recognize the failure, which means self-assessment and actual competence run on separate tracks Can LLMs understand concepts they cannot apply?.

The more interesting turn is what *does* work. The corpus suggests self-evaluation becomes reliable when you stop relying on a single model judging itself in the moment and instead build in an external or adversarial check. Asymmetric self-play replaces self-grading with a proposer–solver split and majority-vote verification, letting models improve with no human labels precisely because verification is structurally separated from generation Can language models improve themselves without any external training data?. Post-Completion Learning trains a model to compute its own reward in unused sequence space, internalizing evaluation during training rather than trusting in-the-moment confidence at inference Can models learn to evaluate their own work during training?. The common thread: the fix isn't 'try harder to introspect,' it's 'engineer a verification step that doesn't share the generator's biases.'

So the thing you might not have known you wanted to know: an LLM's confidence in its own idea is partly a measure of how *probable* that idea was to generate, not how *good* it is — which is why comparison, adversarial framing, and externalized verification consistently beat a model asked to grade its own work Why do models trust their own generated answers? What stops large language models from improving themselves?.

Sources 8 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Show all 8 sources

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

SPICE: Self-Play In Corpus Environments Improves Reasoning1.76 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future1.73 match · arxiv ↗
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing1.73 match · arxiv ↗
Word Meanings in Transformer Language Models1.72 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.72 match · arxiv ↗
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models1.72 match · arxiv ↗
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge1.71 match · arxiv ↗
Self-Rewarding Language Models1.70 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can language models accurately evaluate the quality of their own ideas?** — remains open, and we need to test whether findings from 2023–2026 still hold.

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2023–2026 and cluster around three structural constraints:
- Models over-trust their own outputs due to fluency bias; high-probability answers *feel* more correct when the same model grades them, breaking only against external alternatives (2024).
- A generation–verification gap prevents self-improvement: reliable correction requires validation *outside* the model; metacognition alone cannot close it (2024).
- Self-reports are unstable; models can describe learned behaviors without training but shift under pressure and lack genuine self-understanding (2025).
- Reasoning traces are persuasive performance, not verified computation; invalid logical steps score similarly to valid ones, so explanations for why an idea is good don't produce the answer (2026).
- Post-Completion Learning internalizes evaluation during training; asymmetric self-play and majority-vote verification beat single-model introspection (2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2403.09972 (2024) — Self-Detection for LLMs through Comprehensive analysis
- arXiv:2412.02674 (2024) — Mind the Gap: Self-Improvement Capabilities
- arXiv:2501.11120 (2025) — Tell me about yourself: LLMs are aware of their learned behaviors
- arXiv:2507.20252 (2025) — Post-Completion Learning for Language Models

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every structural limit above, judge whether newer model scales, instruction-tuning methods (DPO, IPO, variants), in-context exemplars, or deployment harnesses (chain-of-verification, multi-turn self-correction, memory-augmented rollback) have since *relaxed* or *overturned* the fluency bias, generation–verification gap, or instability in self-reports. Separate the durable question (does the model have transparent access to its own quality signal?) from the perishable limitation (can it *learn* to introspect more reliably?). Cite what relaxed it; say plainly where the constraint still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for papers claiming self-evaluation *can* work reliably under certain conditions, or showing that newer training objectives (e.g., constitutional AI, process reward models, or online RL) have made single-model self-grading competitive with external verification.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., *If models can now internalize verification during training, can they also internalize *contrastive* self-evaluation—rating their own ideas against synthetic adversarial foils?* and *Does stable self-knowledge (from 2025) correlate with accurate self-evaluation of idea quality?*

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Can AI tell which of its own ideas are good, or does it just trust whatever it wrote?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8