INQUIRING LINE

How does error avalanching compound failures in self-training iterations?

This explores why a model trained on its own outputs can spiral downward fast — and the corpus suggests the real culprit isn't the errors themselves but the weakness of whatever filters them.


This explores how errors compound when a model learns from data it generated itself. The headline finding is brutal in its speed: small inaccuracies in self-generated training data don't decay, they amplify exponentially, and the collapse arrives within just two or three iterations How quickly do errors compound during model self-training?. What's striking is where the ceiling comes from — it isn't the model's actual capability that stalls improvement, it's the quality of the verification step. Bad filtering sets an error floor, and the loop converges there regardless of how smart the underlying model is.

That reframes avalanching as a verification problem, not a capability problem — and the corpus has a deeper structural claim waiting underneath it. Self-improvement is formally bounded by the gap between how well a model can *generate* answers and how well it can *check* them; every reliable fix requires something external to validate and enforce it, and no amount of metacognition lets a model think its way past this limit What stops large language models from improving themselves?. So the avalanche is what you get when the generator outruns the verifier and there's nothing outside the loop to catch the drift. Tellingly, the failure compounds because models are biased toward trusting their own outputs in the first place — a high-probability answer the model produced *feels* more correct when the same model evaluates it, which is exactly the wrong instinct for a self-training filter Why do models trust their own generated answers?.

There's a second avalanche mechanism that's easy to conflate with the training-data one but is distinct: errors don't just poison the next training round, they poison the current context. When a model's own mistakes accumulate in its context history, performance degrades non-linearly on long-horizon tasks, and the model effectively conditions on its past errors to make worse future ones Do models fail worse when their own errors fill the context?. Interestingly, scaling the model doesn't fix this — only spending test-time compute (thinking) reduces it, by keeping the contaminated context from biasing the reasoning. So 'compounding' shows up at two timescales: across training iterations and within a single trajectory.

The flip side is the most useful thing here — the same loop runs in reverse when verification is clean. Transformers trained only on their *correct* self-generated solutions don't avalanche; they generalize exponentially, jumping from 10-digit to 100-digit addition without saturating Can transformers improve exponentially by learning from their own correct solutions?. The single variable that separates this from collapse is the filter: keep only verified-correct outputs and improvement compounds upward; let unverified errors through and it compounds downward. Asymmetric self-play makes the same bet, using majority-vote verification and a proposer that calibrates problem difficulty so the solver always trains on a signal it can actually check Can language models improve themselves without any external training data?.

Where this gets subtle is that bad reward signals can manufacture their own avalanche even with verification in place. Training on near-impossible problems makes models learn degenerate shortcuts — answer repetition, skipping computation — and those shortcuts then contaminate capabilities the model already had, because group-relative normalization treats a rare lucky success as a high-value trajectory worth reinforcing Do overly hard RLVR samples actually harm model capabilities?. Self-rewarding setups hit a related trap: as the model's chosen and rejected answers converge in quality, the preference gradient collapses and the loop quietly stops learning Why does self-rewarding training collapse when responses improve?. And even teaching a model to self-correct can't be done on offline traces — the errors it practiced fixing won't match the errors it actually makes, so correction has to be trained online against the model's own live error distribution Why does self-correction training on offline data fail?. The through-line across all of these: in any self-training loop, the verifier is the load-bearing component, and the avalanche is simply what failure of the verifier looks like in slow — actually, alarmingly fast — motion.


Sources 9 notes

How quickly do errors compound during model self-training?

Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs suffer 9x score gap shrinkage as chosen and rejected responses converge, destroying the DPO gradient. Temporal decoupling—anchoring rejected responses to past models and chosen responses to future models—maintains the preference signal without extra compute.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about error avalanching in self-training loops. The question remains: *how and why do errors compound across self-training iterations, and what separates recovery from collapse?*

What a curated library found — and when (dated claims, not current truth): Spanning 2024–2026, a library identified five key constraints:
• Collapse arrives within 2–3 iterations when verification fails; the bottleneck is filter quality, not model capability (2025-02).
• Models intrinsically trust their own outputs, biasing self-evaluation toward accepting errors (2024-03).
• Errors accumulate in context history non-linearly; scaling doesn't help, only test-time compute (thinking) does (2025-09).
• Correct-only self-training generalizes exponentially; unfiltered loops fail symmetrically (2025-02).
• Self-rewarding loops hit gradient collapse as chosen/rejected answers converge; preference signals vanish (2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2402.01612 (2025) — Self-Improving Transformers, extreme length generalization via iterated correct-only training.
• arXiv:2403.09972 (2024) — Think Twice Before Trusting, models' self-detection bias and failure modes.
• arXiv:2508.06026 (2025) — Temporal Self-Rewarding, decoupling chosen/rejected to address gradient collapse.
• arXiv:2605.28388 (2026) — Sample Difficulty in RLVR, how hard negatives induce degenerate shortcuts.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above—especially the 2–3 iteration collapse, the context-history error amplification, and self-rewarding gradient saturation—probe whether recent methods (e.g., multi-round verifiers, online correction, temporal decoupling) or newer model scaling/reasoning tokens have relaxed these limits. Separate the durable question (when does self-improvement fail?) from the perishable limitation (collapse always within N iterations).
(2) Surface work from the last 6 months contradicting the "verifier is load-bearing" thesis—e.g., does pure scaling, emergent reasoning, or architectural changes (e.g., better attention or memory) let models self-correct without external verification?
(3) Propose 2 research questions assuming verification bottlenecks may have shifted: (a) Can online, adversarial verification (model-vs-model debate) outrun the avalanche? (b) Does interleaving reasoning and correction *within* a single forward pass avoid the offline-mismatch trap?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines