What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

"Mind the Gap" (Song et al., 2025) formalizes the precondition for self-improvement: the generation-verification gap, defined as the difference between a model's ability to verify solutions versus its ability to generate them. When this gap is positive, self-improvement has room to operate — the model can evaluate outputs better than it can produce them, creating a usable training signal.

The gap scales monotonically with pretraining FLOPs. Larger models have proportionally larger generation-verification gaps, which explains why self-improvement methods work better on larger models. For 4×4 Sudoku (NP-hard generation, P verification), only the largest models (72B+) show non-trivial gaps, with 50-300% accuracy improvement.

However, the gap vanishes for factual recall tasks. On Natural Questions, the gap is <1% or negative across all model sizes — verification provides no additional signal because knowing the answer and verifying the answer require the same factual knowledge. This predicts which tasks will benefit from self-improvement and which won't: tasks where generation is computationally harder than verification (math, code, structured problems) benefit; tasks where both require the same knowledge (factual QA) don't.

The diversity collapse finding is equally important: during iterative self-improvement, pass@k increases for small k (quality improves at the top) but decreases for large k (diversity decreases overall). The model converges on solutions it can verify, which are typically common patterns. Rare but correct solutions get filtered out because the model can't verify them. This is the entropy collapse dynamic operating through the verification bottleneck rather than through the policy directly.

The non-overlap property of verification mechanisms — different verifiers catch different errors despite functional similarity — suggests that compositional verification (combining multiple verification approaches) could substantially extend the ceiling. This is architecturally distinct from the temporal anchoring solution in Why does self-rewarding training collapse when responses improve? — one fixes the preference signal, the other expands the verification surface.

Promptbreeder as a practical bound-pusher for prompt optimization: Promptbreeder (Fernando et al., 2023) demonstrates a practical approach to push against these bounds for prompt optimization specifically. It overcomes APE's "diminishing returns after three rounds" through a diversity-maintaining evolutionary algorithm where mutation-prompts (instructions for modifying task-prompts) evolve alongside task-prompts — self-referential self-improvement grounded in LLMs. Promptbreeder outperforms CoT and Plan-and-Solve on arithmetic and commonsense reasoning. However, the self-improvement is still bounded by the LLM's generation capability — mutation-prompts can only express modifications the model can articulate, and fitness evaluation depends on the model's own outputs. This makes Promptbreeder a concrete instantiation of the gap framework: the generation-verification gap determines the ceiling, and the evolutionary diversity mechanism delays the diversity collapse without eliminating it. Source: Prompts Prompting.

Empirical validation via evolutionary self-improvement (DGM): The Darwin Gödel Machine replaces formal self-improvement proofs with empirical validation — evolutionary archive of past modifications, population-based search through code-level self-modifications, and fitness measured by benchmark performance. DGM improved Coder from 20.3% to 50.0% on SWE-bench Verified through iterative self-modification. This sidesteps the generation-verification gap by changing what "verification" means: instead of the model verifying its own outputs against a fixed standard, verification is empirical (does performance improve?) and historical (does the archive contain precedents?). The gap framework predicts this should work: empirical testing is a stronger verifier than self-evaluation, and evolutionary archives provide external reference points that prevent the diversity collapse that pure self-improvement suffers. See Can AI systems improve themselves through trial and error?.

The generator-discriminator-critique gap provides concrete evidence. Saunders et al. (2022) fine-tune large language models to write natural language critiques of model outputs. On topic-based summarization, model-written critiques help humans find flaws they would have missed. However, "we failed to find a clear trend showing critique performance catching up to discriminator performance, implying that larger models still have relevant knowledge they don't articulate as critiques." This is a direct instantiation of the generation-verification gap: the model can discriminate quality (verification) better than it can explain what's wrong (generation of critique). The gap persists at scale, suggesting it is structural rather than a matter of insufficient training. Source: Arxiv/Evaluations.

Inquiring lines that read this note 22

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can model confidence signals reliably improve reasoning quality and calibration?

Do models actually self-assess their confidence or just confirm answers?

Why does verification consistently lag behind AI generation?

Can single-axis benchmarks accurately predict agent deployment success?

How does benchmark performance measure translate to general self-modification ability?

How can AI agents autonomously learn and transfer skills across tasks?

What capabilities can emerge from self-modification that the original agent lacked?

How can AI systems learn from failures without cascading errors?

Can population diversity in self-improvement prevent error avalanching failures?

How does objective evolution guide discovery better than fixed planning?

Is model self-awareness based on genuine introspection or pattern matching?

What skills can large models identify and organize about their own abilities?

How do self-generated feedback mechanisms enable effective model learning?

Why does self-revision increase model confidence while degrading accuracy?

Can a model evaluate its own improvements without degrading over iterations?

When does optimizing for quality undermine the value of diversity?

How does diversity collapse during iterative self-improvement affect solution quality?

Why do self-improving systems struggle without clear external performance metrics?

Why do most self-improving systems fail when given tasks with no clear external benchmark?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

28 direct connections · 226 in 2-hop network ·medium cluster Open in graph ↗

What limits how much models can improve themselv… Does a model improve by arguing with itself? Does policy entropy collapse limit reasoning perfo… How quickly do errors compound during model self-t… Why does self-rewarding training collapse when res… Can AI systems improve themselves through trial an… Can LLMs understand concepts they cannot apply? Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
specific instance: single-model self-revision collapses when the generation-verification gap is narrow
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
diversity collapse during self-improvement mirrors entropy collapse during RL; the mechanism differs (verification filtering vs policy concentration) but the outcome is the same
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
a related but distinct iterative failure mode; error avalanching is about error accumulation, the gap framework is about verification ceilings
Why does self-rewarding training collapse when responses improve? Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
gradient collapse is one consequence of a narrowing generation-verification gap
Can AI systems improve themselves through trial and error? Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.
empirical validation + evolutionary archives sidestep the formal gap by changing what verification means
Can LLMs understand concepts they cannot apply? Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding is a qualitative manifestation of a positive generation-verification gap: the model verifies/explains better than it generates/applies, and this disconnect is exactly what makes self-improvement possible on those tasks
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
the knowing-doing gap (87% correct rationales, 64% correct actions) quantifies the generation-verification gap in sequential decision-making: the model's verification ability (rationale generation) exceeds its generation ability (action selection)

What limits how much models can improve themselves?

Inquiring lines that read this note 22

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4