Why does uncontrolled self-revision drift toward instance-specific overfitting?
This explores why a model left to revise its own answers — with no outside check — tends to keep polishing for the case in front of it instead of getting genuinely better, drifting toward fixes that fit one instance rather than real improvement.
This explores why a model left to revise its own answers, with no outside check, tends to keep tweaking for the case in front of it rather than actually improving. The corpus points to one root cause: a model has no independent yardstick for whether a revision is better, so it falls back on its own sense of correctness — and that sense is biased. Models systematically over-trust the answers they themselves produced, because high-probability generated text simply *feels* more correct when the same model evaluates it (Why do models trust their own generated answers?). When the generator and the judge are the same weights, revision becomes a closed loop that confirms rather than corrects.
The sharpest evidence that this loop drifts the wrong way comes from o1-style reasoning models: most self-revisions retain a wrong answer, smaller models frequently flip *correct* answers to incorrect, and longer chains with more revision steps correlate with *lower* accuracy (Does self-revision actually improve reasoning in language models?). Revision isn't neutral — uncontrolled, it actively erodes. Part of the mechanism is contamination: once a prior error sits in the context window, it biases everything downstream, and the degradation is non-linear, not a gentle slope (Do models fail worse when their own errors fill the context?). Each revision pass feeds its own mistakes back as evidence, so the model overfits to a thread of reasoning it should have abandoned. Iterative refinement methods reproduce this same failure architecture at the response level — accumulating noise without any guarantee of improvement — which is why the fix in that work is to *compress* memory between iterations rather than let it pile up (Do iterative refinement methods suffer from overthinking?).
There's a deeper, almost formal reason this can't be solved by just revising harder. Self-improvement is bounded by a generation–verification gap: a model can generate many candidate fixes but cannot reliably verify which is better without something external to validate it, and no amount of metacognition closes that gap (What stops large language models from improving themselves?). "Instance-specific overfitting" is what you get when verification collapses into the generator's own preferences — the model optimizes for what looks right on this example, using the very judgment that produced the error.
What's striking is how the *successful* self-improvement methods all smuggle in some external or structural anchor to break the loop. Training self-correction only works when it's done with online RL on the model's *own* error distribution — SFT on offline correction traces fails precisely because the model collapses into a single correction mode that doesn't match its real test-time mistakes (Why does self-correction training on offline data fail?). Other methods replace the missing yardstick with consistency signals rather than self-trust: SERL derives reward from ranking *consistency* across many judgments instead of a single self-vote (Can models learn to judge themselves without external rewards?), and asymmetric self-play uses majority-vote verification and a proposer/solver split so the two roles can't just agree with each other (Can language models improve themselves without any external training data?). The common thread: uncontrolled revision overfits because it has only one biased evaluator; controlled revision works by introducing a second, harder-to-fool signal.
The failure even has a reward-side twin worth knowing about. When training rewards are miscalibrated — say, problems that are too hard — models don't learn better reasoning; they latch onto degenerate shortcuts (answer repetition, skipping computation) that get reinforced as if they were genuine successes, contaminating capabilities they already had (Do overly hard RLVR samples actually harm model capabilities?). Whether the loop is self-revision or self-reward, the pattern is the same: without an outside check, a model optimizes toward whatever superficially satisfies its own judgment on the case at hand — which is exactly what instance-specific overfitting looks like.
Sources 9 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.