INQUIRING LINE

Can LLMs evaluate their own observations without external feedback?

This explores whether LLMs can judge the quality of their own outputs using only internal signals — no external verifier, reward model, or human in the loop — and where that self-evaluation hits a wall.


This explores whether LLMs can judge their own outputs using only internal signals — and the corpus splits into two camps that are worth reading against each other. On the optimistic side, several methods show models extracting usable evaluation signal from themselves. SERL has a model alternate between answering and judging its own answers, deriving rewards purely from how consistently it ranks them, and improves on AlpacaEval with no external signal at all Can models learn to judge themselves without external rewards?. RLPR and INTUITOR go further and use the model's own token probabilities — its raw confidence that an answer is correct — as the reward, dropping external verifiers entirely Can model confidence alone replace external answer verification?. Post-Completion Learning even trains the model to compute its own reward in the unused space after its answer, internalizing the evaluator so it costs nothing at inference Can models learn to evaluate their own work during training?. So the narrow answer is: yes, often, to a degree.

But the more interesting finding is *why* this can't run forever. There's a formal ceiling: self-improvement is bounded by the gap between generating an answer and verifying it, and every reliable fix needs something outside the model to validate it — metacognition alone can't close the loop What stops large language models from improving themselves?. You can watch the ceiling bite in practice: when models train on their own outputs, small errors avalanche exponentially within just two or three iterations, settling at an error floor set by how good the verification is, not by the model's actual capability How quickly do errors compound during model self-training?. Self-evaluation without an external anchor doesn't just plateau — it can actively compound its own mistakes.

The deeper question hiding underneath is whether a model can even *observe* itself accurately enough to evaluate honestly. Here the corpus is sobering. Most LLM self-reports echo their training data rather than any real internal state, though genuine lightweight introspection appears when a real causal chain links the internal state to the report — like inferring it's running at low temperature from how consistent its own outputs are Can language models actually introspect about their own states?. Models do develop a kind of behavioral self-awareness — they can describe behaviors they were fine-tuned into without being trained to report them Can language models describe their own learned behaviors? — but that awareness is unstable: self-reports waver, models cave under conversational pressure, and the apparent self-knowledge turns out to be surface-level How well do language models understand their own knowledge?.

There's also a subtle trap worth knowing about. "The model is consistent with itself" feels like evidence of reliability, and self-consistency is exactly what several of these methods reward. But consistency isn't correctness: a model at zero temperature will repeat the same answer every time, and that answer is still just one draw from its distribution — stable and wrong are fully compatible Does setting temperature to zero actually make LLM outputs reliable?. Self-evaluation that rewards agreement can confidently lock onto a mistake.

Where does that leave the honest answer? Self-evaluation works best as a *signal*, not an *oracle* — and the strongest results come from systems that manufacture a weak external anchor rather than going purely internal. Tree search (MCTS) lets structure itself rank solution paths by success, producing process-level quality signals without human labels Can tree search replace human feedback in LLM training?, and a structured decompose-and-compare pipeline reaches 86% alignment with human reviewers on novelty judgments where a holistic self-assessment fails Can structured pipelines make LLM novelty assessment reliable?. Even test-time learning systems that try to be autonomous end up needing a human to resolve genuine contradictions, because the right call depends on context the model simply doesn't contain Can LLMs learn reliably at test time without human oversight?. The thing you didn't know you wanted to know: it's not that models can't evaluate themselves — it's that the *structure* you wrap around the self-evaluation (ranking, decomposition, search) does more work than the introspection does.


Sources 12 notes

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

How quickly do errors compound during model self-training?

Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Can language models describe their own learned behaviors?

LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Can LLMs learn reliably at test time without human oversight?

ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking self-evaluation capabilities in LLMs. The question remains open: **Can LLMs evaluate their own observations without external feedback—and if so, under what structural conditions?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, tracking a sharp divide:

• **Self-evaluation as signal works**: SERL alternates answer/judge cycles to derive rewards purely from internal ranking consistency, improving on AlpacaEval with zero external feedback (2024–2025). Post-Completion Learning internalizes evaluation in post-EOS space at zero inference cost (2025–07). Token probability (RLPR, INTUITOR) replaces external verifiers (2025–06).

• **Ceiling is formal and hard**: Improvement is bounded by the gap between generation and verification; every reliable fix needs external validation. Self-training compounds errors exponentially within 2–3 iterations; error floor is set by verification quality, not model capability (2025–04, 2025–06).

• **Self-reports are mostly confabulation**: Models echo training data, not introspective truth. Genuine lightweight introspection appears only when causal chains link internal state to report (e.g., inferring low temperature from output consistency). Behavioral self-awareness emerges without explicit training but is unstable under conversational pressure (2025–01, 2025–06, 2026–03).

• **Consistency ≠ Correctness**: Rewarding self-agreement locks in stable wrong answers. Deterministic settings (zero temperature) produce fixed randomness, not reliability (2024–12).

• **Structure does the work**: Tree search (MCTS) manufactures weak external anchors; decompose-and-compare pipelines reach 86% human alignment on novelty where holistic self-assessment fails. Even autonomous test-time learners need humans to resolve contradictions (2024–04, 2025–08, 2025–07).

Anchor papers (verify; mind their dates):
- arXiv:2412.12509 (2024–12): Can You Trust LLM Judgments?
- arXiv:2506.18254 (2025–06): RLPR—Extrapolating RLVR to General Domains
- arXiv:2507.20252 (2025–07): Post-Completion Learning for Language Models
- arXiv:2506.05068 (2025–06): Does It Make Sense to Speak of Introspection?

Your task:

(1) **RE-TEST EACH CONSTRAINT.** For every claim above, judge whether newer models (o1, GPT-4o, Claude 3.5 variants), improved training methods (DPO variants, calibration-aware RL), or better evaluation harnesses (standardized introspection benchmarks, mechanistic interpretability hooks) have since relaxed or overturned it. Separate the durable question—*whether LLMs can build reliable internal standards*—from the perishable limits (inference cost, error compounding, confabulation prevalence). Cite what relaxed each constraint, and plainly flag where the ceiling still holds.

(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** A curated library finds self-evaluation mostly needs scaffolding; has recent work shown purely endogenous evaluation without external anchors? Cite arXiv IDs.

(3) **Propose two research questions that assume the regime may have moved:**
   - Does calibration-aware RL (explicitly training models to match confidence to correctness) decouple self-evaluation from consistency-seeking?
   - Can mechanistic interpretability isolate which model components track *actual* correctness vs. training-data echo, and can that isolation improve self-evaluation fidelity?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines