INQUIRING LINE

Can a model evaluate its own improvements without degrading over iterations?

This explores whether a model can act as its own judge — scoring and refining its own work across rounds — without quietly getting worse each time it does so.


This explores whether a model can act as its own judge — scoring and refining its own work across rounds — without quietly getting worse each time it does so. The corpus gives a split answer: self-evaluation works when something keeps it honest, and degrades predictably when nothing does. The cleanest statement of the limit is the generation-verification gap: a model can only improve itself in domains where it judges solutions better than it produces them, and that margin shrinks toward zero on factual tasks What limits how much models can improve themselves? What stops large language models from improving themselves?. So "evaluate its own improvements" isn't one capability — it depends entirely on whether verification is cheaper than generation for the task at hand.

The degradation isn't hypothetical, and it has a specific shape. A model asked to revise based on its own prior reasoning tends to grow *more* confident in wrong answers, not less — a failure mode distinct enough to have its own name, degeneration of thought Does a model improve by arguing with itself?. The root cause is a structural bias: models over-trust answers they generated themselves, because a high-probability output simply *feels* correct when the same model grades it Why do models trust their own generated answers?. Stack iterations on top of that and the errors compound — prior mistakes sitting in the context history bias the next step, producing sharp non-linear decay over long-horizon tasks Do models fail worse when their own errors fill the context?. Iterative refinement can even reproduce "overthinking": more rounds accumulating noise without guaranteed gains Do iterative refinement methods suffer from overthinking?.

What breaks the spiral, consistently, is *diversity* — comparing against something other than yourself. Multi-agent debate between genuinely different models reverses the confidence-in-errors pattern and improves calibration Does a model improve by arguing with itself?. Self-detection improves the moment a model compares its answer against broader alternatives instead of agreeing with itself Why do models trust their own generated answers?. And the survey of reliable self-improvement methods makes the trick explicit: the ones that work all smuggle in an external anchor — a past model version, a third-party judge, user corrections, or tool feedback — even when they're marketed as "pure" self-improvement Can models reliably improve themselves without external feedback?.

The surprise is how *much* a model can improve over iterations once you supply a verifier it can't fool. Transformers learning only from their own correct solutions — filtered for correctness, an external check — jump from 10-digit to 100-digit addition with exponential, non-saturating gains Can transformers improve exponentially by learning from their own correct solutions?. Asymmetric self-play has a proposer invent problems and a solver learn via majority-vote verification, both improving through RL with no human labels Can language models improve themselves without any external training data?. SERL alternates a model between answering and judging, deriving reward from ranking consistency Can models learn to judge themselves without external rewards?. The Darwin Gödel Machine keeps an evolutionary archive and validates variants by benchmarking rather than self-belief Can AI systems improve themselves through trial and error?.

The through-line worth taking away: the thing that makes self-evaluation degrade isn't iteration itself, it's *self-agreement* — a model grading its own output with its own biases. Every method that iterates without collapsing has quietly replaced "do I think this is better?" with a check the model can't talk itself out of: a correctness filter, a vote, a different model, a benchmark. Notably, raw scale doesn't rescue you here — bigger models still self-condition on their errors; only spending compute at test time to keep contaminated context from biasing the next step reliably helps Do models fail worse when their own errors fill the context?.


Sources 11 notes

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do iterative refinement methods suffer from overthinking?

Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst examining whether large language models can evaluate and improve their own outputs across iterations without performance collapse. The question remains open: under what structural conditions does self-evaluation sustain improvement, and when does it degrade?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, tracking both failure modes and rescue mechanisms:

• Models exhibit a generation-verification gap: they judge solutions better than they produce them only when verification is genuinely cheaper than generation; this margin shrinks to zero on factual tasks, setting a hard limit on self-improvement (2024–2025).
• Single-model self-revision triggers degeneration of thought — a distinct failure mode where models grow more confident in wrong answers because they over-trust outputs they generated themselves (2024).
• Self-conditioning amplifies errors non-linearly: prior mistakes in context history bias subsequent steps, producing sharp performance decay over long-horizon tasks; scale alone does not rescue this (2025–2026).
• Every robust self-improvement method smuggles in an external anchor — a different model, a correctness filter, majority-vote verification, or a benchmark — breaking the self-agreement spiral (2024–2026).
• Transformers learning only from filtered-correct solutions (external verification) achieve exponential, non-saturating gains: 10-digit to 100-digit addition (2025).

Anchor papers (verify; mind their dates):
• arXiv:2403.09972 (Mar 2024): Think Twice Before Trusting — self-detection fails due to inherent bias toward self-generated answers.
• arXiv:2412.02674 (Dec 2024): Mind the Gap — formalizes generation-verification gap as binding constraint.
• arXiv:2502.01612 (Feb 2025): Self-Improving Transformers — demonstrates exponential gains under external verification.
• arXiv:2505.22954 (May 2025): Darwin Godel Machine — evolution + benchmarking replaces self-belief as validation.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer training paradigms (e.g., test-time scaling, mixture-of-experts, or constitutional AI variants), evaluation harnesses, or multi-agent orchestration (e.g., cascade architectures, persistent memory stores) have relaxed the generation-verification gap, reduced self-conditioning bias, or made external anchors unnecessary. Separate the durable question (likely: do models fundamentally over-trust their own outputs?) from perishable limitations (possibly: scale/training can eliminate degeneration of thought). Cite what resolved it, or state plainly where constraints still hold.
(2) Surface the strongest contradicting or superseding work from the last 6 months — especially papers claiming self-improvement works *without* external feedback, or showing calibration can overcome self-agreement bias.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., (a) Can a single model self-improve if it learns to *simulate* adversarial disagreement in its own reasoning? (b) Does test-time compute allocation to uncertainty estimation eliminate the need for external verification?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines