Why does iterative refinement fail when information stays constant?
This explores why repeatedly revising an answer often stops helping — and connects that to a deeper claim: that when no new information enters the loop, iteration mostly recycles what's already there rather than improving it.
This explores why repeatedly revising an answer often stops helping when each pass works from the same fixed information. The corpus has a sharp answer: iteration without new information doesn't refine, it accumulates. Sequential revision methods reproduce the same failure as token-level 'overthinking' — they pile up noise across passes with no guarantee any pass is better than the last Do iterative refinement methods suffer from overthinking?. The fix there is telling: Progressive Draft Refinement wins not by revising more but by *compressing memory between iterations* — proof that the problem isn't too few passes, it's carrying the same unfiltered content forward.
The deeper reason shows up when you ask whether models even *do* iteration in the way we imagine. One result finds that LLMs cannot actually execute iterative numerical procedures in latent space — they recognize a problem as template-similar to something memorized and emit plausible-but-wrong values, a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. So when 'refinement' runs over constant information, there's often no genuine optimization step underneath; the loop is pattern-matching the same input and getting the same neighborhood of answer. Relatedly, frontier reasoning models that look fluent at reflection hit a wall (20-23% exact match) on constraint-satisfaction problems that require real backtracking — reflective fluency doesn't convert into actual problem-solving on unfamiliar structure Can reasoning models actually sustain long-chain reflection?.
What distinguishes the methods that *do* improve under iteration is that they change what's carried between rounds rather than re-running on a frozen state. Context-as-playbook approaches treat the context as something that grows through generation-reflection-curation, deliberately preventing 'context collapse' where each rewrite erodes detail — structured incremental updates beat full rewrites precisely because they protect information instead of recompressing it away Can context playbooks prevent knowledge loss during iteration?. And in multi-turn search, agents degrade when each turn burns the context needed for the next; capping per-turn reasoning preserves room to actually *incorporate new evidence* across cycles Does limiting reasoning per turn improve multi-turn search quality?. The common thread: useful iteration is iteration that ingests or curates something new each round.
There's a more radical framing worth knowing: when the information is fixed, sometimes the answer is to stop iterating *deeper* and iterate *wider*. Instead of grinding a single chain longer, sample parallel trajectories that explore different regions of the solution space at once — width recovers the gains depth-only refinement can't, without the variance blowup Can reasoning systems scale wider instead of only deeper?. This reframes the failure: a refinement loop over constant information is a narrow walk that keeps revisiting the same place. Breadth, not more revision, is what finds the better answer.
Finally, the limit case. Self-correction has a formal ceiling: any computable LLM must hallucinate on infinitely many inputs, and *internal* mechanisms like self-correction provably cannot eliminate it Can any computable LLM truly avoid hallucinating?. That's the mathematical version of the same intuition — a system reasoning in a closed loop over its own fixed outputs can't bootstrap its way to correctness. The escape isn't more refinement; it's new information from outside the loop, whether that's external evidence, curated memory, or empirical validation against the world (the move that lets self-improving systems like the Darwin Gödel Machine actually progress) Can AI systems improve themselves through trial and error?.
Sources 8 notes
Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.