What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
"Mind the Gap" (Song et al., 2025) formalizes the precondition for self-improvement: the generation-verification gap, defined as the difference between a model's ability to verify solutions versus its ability to generate them. When this gap is positive, self-improvement has room to operate — the model can evaluate outputs better than it can produce them, creating a usable training signal.
The gap scales monotonically with pretraining FLOPs. Larger models have proportionally larger generation-verification gaps, which explains why self-improvement methods work better on larger models. For 4×4 Sudoku (NP-hard generation, P verification), only the largest models (72B+) show non-trivial gaps, with 50-300% accuracy improvement.
However, the gap vanishes for factual recall tasks. On Natural Questions, the gap is <1% or negative across all model sizes — verification provides no additional signal because knowing the answer and verifying the answer require the same factual knowledge. This predicts which tasks will benefit from self-improvement and which won't: tasks where generation is computationally harder than verification (math, code, structured problems) benefit; tasks where both require the same knowledge (factual QA) don't.
The diversity collapse finding is equally important: during iterative self-improvement, pass@k increases for small k (quality improves at the top) but decreases for large k (diversity decreases overall). The model converges on solutions it can verify, which are typically common patterns. Rare but correct solutions get filtered out because the model can't verify them. This is the entropy collapse dynamic operating through the verification bottleneck rather than through the policy directly.
The non-overlap property of verification mechanisms — different verifiers catch different errors despite functional similarity — suggests that compositional verification (combining multiple verification approaches) could substantially extend the ceiling. This is architecturally distinct from the temporal anchoring solution in Why does self-rewarding training collapse when responses improve? — one fixes the preference signal, the other expands the verification surface.
Promptbreeder as a practical bound-pusher for prompt optimization: Promptbreeder (Fernando et al., 2023) demonstrates a practical approach to push against these bounds for prompt optimization specifically. It overcomes APE's "diminishing returns after three rounds" through a diversity-maintaining evolutionary algorithm where mutation-prompts (instructions for modifying task-prompts) evolve alongside task-prompts — self-referential self-improvement grounded in LLMs. Promptbreeder outperforms CoT and Plan-and-Solve on arithmetic and commonsense reasoning. However, the self-improvement is still bounded by the LLM's generation capability — mutation-prompts can only express modifications the model can articulate, and fitness evaluation depends on the model's own outputs. This makes Promptbreeder a concrete instantiation of the gap framework: the generation-verification gap determines the ceiling, and the evolutionary diversity mechanism delays the diversity collapse without eliminating it. Source: Prompts Prompting.
Empirical validation via evolutionary self-improvement (DGM): The Darwin Gödel Machine replaces formal self-improvement proofs with empirical validation — evolutionary archive of past modifications, population-based search through code-level self-modifications, and fitness measured by benchmark performance. DGM improved Coder from 20.3% to 50.0% on SWE-bench Verified through iterative self-modification. This sidesteps the generation-verification gap by changing what "verification" means: instead of the model verifying its own outputs against a fixed standard, verification is empirical (does performance improve?) and historical (does the archive contain precedents?). The gap framework predicts this should work: empirical testing is a stronger verifier than self-evaluation, and evolutionary archives provide external reference points that prevent the diversity collapse that pure self-improvement suffers. See Can AI systems improve themselves through trial and error?.
The generator-discriminator-critique gap provides concrete evidence. Saunders et al. (2022) fine-tune large language models to write natural language critiques of model outputs. On topic-based summarization, model-written critiques help humans find flaws they would have missed. However, "we failed to find a clear trend showing critique performance catching up to discriminator performance, implying that larger models still have relevant knowledge they don't articulate as critiques." This is a direct instantiation of the generation-verification gap: the model can discriminate quality (verification) better than it can explain what's wrong (generation of critique). The gap persists at scale, suggesting it is structural rather than a matter of insufficient training. Source: Arxiv/Evaluations.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do models actually self-assess their confidence or just confirm answers?
- How does the generation-verification gap limit AI self-improvement capabilities?
- How does benchmark performance measure translate to general self-modification ability?
- What capabilities can emerge from self-modification that the original agent lacked?
- Can population diversity in self-improvement prevent error avalanching failures?
- Can co-evolved critics truly circumvent static evaluator limitations in self-improvement?
- What skills can large models identify and organize about their own abilities?
- Why does optimizing only quality cause model collapse in self-improvement loops?
- How should training incorporate external critique versus encouraging self-correction?
- Can capability boundary collapse be reversed through external data?
- Can multiple verification approaches together overcome the self-improvement ceiling?
- Does the generation-verification gap actually limit self-improvement in verifiable tasks?
- Can a model evaluate its own improvements without degrading over iterations?
- How does diversity collapse during iterative self-improvement affect solution quality?
- What separates bootstrapping gains from sustained self-improvement gains?
- How does domain shift expose failures in fixed self-improvement mechanisms?
- How much can externalized skills improve models before hitting diminishing returns?
- Does the generation-verification gap limit how far AI can improve itself?
- Does the generation-verification gap define where self-rewarding actually works?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
specific instance: single-model self-revision collapses when the generation-verification gap is narrow
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
diversity collapse during self-improvement mirrors entropy collapse during RL; the mechanism differs (verification filtering vs policy concentration) but the outcome is the same
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
a related but distinct iterative failure mode; error avalanching is about error accumulation, the gap framework is about verification ceilings
-
Why does self-rewarding training collapse when responses improve?
Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
gradient collapse is one consequence of a narrowing generation-verification gap
-
Can AI systems improve themselves through trial and error?
Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.
empirical validation + evolutionary archives sidestep the formal gap by changing what verification means
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding is a qualitative manifestation of a positive generation-verification gap: the model verifies/explains better than it generates/applies, and this disconnect is exactly what makes self-improvement possible on those tasks
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
the knowing-doing gap (87% correct rationales, 64% correct actions) quantifies the generation-verification gap in sequential decision-making: the model's verification ability (rationale generation) exceeds its generation ability (action selection)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
- Hyperagents
- Self-Improving Model Steering
- Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Original note title
self-improvement is bounded by the generation-verification gap — a formal quantity that scales with pretraining compute and vanishes for factual tasks