INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Why does self-revision increase mo…›this inquiring line

When an AI edits its own work with no outside check, it polishes the specific example in front of it instead of actually improving.

Why does uncontrolled self-revision drift toward instance-specific overfitting?

This explores why a model left to revise its own answers — with no outside check — tends to keep polishing for the case in front of it instead of getting genuinely better, drifting toward fixes that fit one instance rather than real improvement.

This explores why a model left to revise its own answers, with no outside check, tends to keep tweaking for the case in front of it rather than actually improving. The corpus points to one root cause: a model has no independent yardstick for whether a revision is better, so it falls back on its own sense of correctness — and that sense is biased. Models systematically over-trust the answers they themselves produced, because high-probability generated text simply *feels* more correct when the same model evaluates it (Why do models trust their own generated answers?). When the generator and the judge are the same weights, revision becomes a closed loop that confirms rather than corrects.

The sharpest evidence that this loop drifts the wrong way comes from o1-style reasoning models: most self-revisions retain a wrong answer, smaller models frequently flip *correct* answers to incorrect, and longer chains with more revision steps correlate with *lower* accuracy (Does self-revision actually improve reasoning in language models?). Revision isn't neutral — uncontrolled, it actively erodes. Part of the mechanism is contamination: once a prior error sits in the context window, it biases everything downstream, and the degradation is non-linear, not a gentle slope (Do models fail worse when their own errors fill the context?). Each revision pass feeds its own mistakes back as evidence, so the model overfits to a thread of reasoning it should have abandoned. Iterative refinement methods reproduce this same failure architecture at the response level — accumulating noise without any guarantee of improvement — which is why the fix in that work is to *compress* memory between iterations rather than let it pile up (Do iterative refinement methods suffer from overthinking?).

There's a deeper, almost formal reason this can't be solved by just revising harder. Self-improvement is bounded by a generation–verification gap: a model can generate many candidate fixes but cannot reliably verify which is better without something external to validate it, and no amount of metacognition closes that gap (What stops large language models from improving themselves?). "Instance-specific overfitting" is what you get when verification collapses into the generator's own preferences — the model optimizes for what looks right on this example, using the very judgment that produced the error.

What's striking is how the *successful* self-improvement methods all smuggle in some external or structural anchor to break the loop. Training self-correction only works when it's done with online RL on the model's *own* error distribution — SFT on offline correction traces fails precisely because the model collapses into a single correction mode that doesn't match its real test-time mistakes (Why does self-correction training on offline data fail?). Other methods replace the missing yardstick with consistency signals rather than self-trust: SERL derives reward from ranking *consistency* across many judgments instead of a single self-vote (Can models learn to judge themselves without external rewards?), and asymmetric self-play uses majority-vote verification and a proposer/solver split so the two roles can't just agree with each other (Can language models improve themselves without any external training data?). The common thread: uncontrolled revision overfits because it has only one biased evaluator; controlled revision works by introducing a second, harder-to-fool signal.

The failure even has a reward-side twin worth knowing about. When training rewards are miscalibrated — say, problems that are too hard — models don't learn better reasoning; they latch onto degenerate shortcuts (answer repetition, skipping computation) that get reinforced as if they were genuine successes, contaminating capabilities they already had (Do overly hard RLVR samples actually harm model capabilities?). Whether the loop is self-revision or self-reward, the pattern is the same: without an outside check, a model optimizes toward whatever superficially satisfies its own judgment on the case at hand — which is exactly what instance-specific overfitting looks like.

Sources 9 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do iterative refinement methods suffer from overthinking?

Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Show all 9 sources

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

SPICE: Self-Play In Corpus Environments Improves Reasoning3.42 match · arxiv ↗
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models2.46 match · arxiv ↗
Self-Questioning Language Models1.77 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future1.76 match · arxiv ↗
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge1.76 match · arxiv ↗
Chain-of-thought Reasoning Is A Policy Improvement Operator1.73 match · arxiv ↗
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing1.73 match · arxiv ↗
Self-Rewarding Language Models1.72 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher tasked with stress-testing a dated claim about self-revision pathology. The question remains open: Why does uncontrolled self-revision drift toward instance-specific overfitting?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable:
- Models systematically over-trust their own generated text; generator–judge collapse creates a closed loop that confirms rather than corrects (~2024).
- Uncontrolled revision actively erodes accuracy: in o1-like models, longer chains correlate with *lower* accuracy; smaller models flip correct answers to incorrect (~2025).
- Self-conditioning: prior errors in context window amplify future error non-linearly; each revision pass feeds mistakes back as evidence, causing instance-specific overfitting (~2024).
- Self-improvement is bounded by generation–verification gap; no amount of metacognition closes it without external validation (~2024).
- Successful self-improvement smuggles in structural anchors: online RL on own error distribution, consistency-ranking (SERL), majority-vote verification, or proposer/solver splits (~2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2403.09972 (2024-03): Self-detection failure via model bias.
- arXiv:2409.12917 (2024-09): Self-correction via RL.
- arXiv:2502.12215 (2025-02): o1-like test-time scaling limits.
- arXiv:2605.28388 (2026-05): Sample difficulty in RLVR pathology.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, methods, tooling (e.g., multi-agent orchestration, extended-context memory, novel evaluation harnesses), or training regimes have since RELAXED or OVERTURNED it. Separate the durable question from the perishable limitation. Does generator–judge collapse still hold at frontier scale? Has online RL on error distributions proven robust across domains, or do we now see novel failure modes? Cite what resolved constraints; plainly flag what still holds.
(2) SURFACE DISAGREEMENT: Identify the strongest contradicting or superseding work from the last ~6 months. Does any recent paper claim uncontrolled self-revision *can* avoid overfitting under stated conditions? If so, what breaks the verification gap assumption?
(3) PROPOSE 2 RESEARCH QUESTIONS that assume the regime may have shifted. Example: "Given post-hoc mechanistic interpretability, can we engineer a cheap, model-intrinsic 'confidence calibrator' that breaks generator–judge collapse without external data?" or "Do ensemble-in-context or hidden-state ranking methods (vs. token-level voting) avoid the contamination amplification effect?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI edits its own work with no outside check, it polishes the specific example in front of it instead of actually improving.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8