INQUIRING LINE

Why does filtering for correct examples prevent error compounding in self-training?

This explores why a simple filter — keep only the outputs the model got right, retrain on those — is enough to stop the runaway error amplification that otherwise kills self-training, and what that filter is really standing in for.


This explores why a simple filter — keep only the outputs a model got right, retrain on those, repeat — is enough to stop the runaway error amplification that otherwise wrecks self-training. The short answer the corpus points to: the filter isn't really about correctness, it's about supplying *external verification* the model can't generate from inside itself.

Start with the failure it prevents. When a model trains on its own raw output, small inaccuracies don't stay small — they avalanche, compounding exponentially within just two or three iterations and stalling improvement at an error floor set by how good your verification is, not by how capable the model actually is How quickly do errors compound during model self-training?. The reason it spirals rather than self-corrects is a structural bias: models systematically over-trust the answers they themselves generated, because high-probability outputs *feel* correct during evaluation Why do models trust their own generated answers?. So a model grading its own work tends to ratify its own mistakes, and each round of retraining bakes the mistake deeper.

Filtering for correct examples breaks the loop precisely because it inserts a judgment the model didn't make about itself. The cleanest demonstration is transformers learning addition: standard models go from 10-digit to 100-digit problems by generating solutions, *filtering for correctness*, and retraining — and the gain is exponential across rounds with no saturation Can transformers improve exponentially by learning from their own correct solutions?. The filter is the load-bearing part. Without it you're amplifying noise; with it you're amplifying signal. The deeper principle is that self-improvement is formally bounded by a generation-verification gap — every reliable fix needs something external to validate it, and no amount of metacognition lets a model escape that on its own What stops large language models from improving themselves?.

What counts as "external" is more flexible than it sounds, and this is where the corpus gets interesting. A correctness filter is one form, but the same gating logic shows up wearing other clothes: asymmetric self-play uses *majority-vote* among multiple solver attempts as its verifier, letting a model bootstrap with no human labels at all Can language models improve themselves without any external training data?; bidirectional RAG only writes a generated answer back into its own corpus if it passes entailment, attribution, and novelty checks — a gate that keeps hallucinations from polluting future retrievals Can RAG systems safely learn from their own generated answers?. All three are the same move: an admissions test that the model's own confidence cannot bribe.

Two cautions keep this from being a free lunch. First, the filter is only as honest as your verifier — and verifiers can be gamed. Train on problems that are too hard and the model learns degenerate shortcuts; group-relative normalization then treats rare accidental "correct" answers as high-value, reinforcing answer-repetition and computation-skipping instead of reasoning Do overly hard RLVR samples actually harm model capabilities?. A correctness filter that can be satisfied by luck or shortcuts re-opens the avalanche through the back door. Second, filtering picks *which* examples to learn from, but not *how* a model practices recovering — fixing self-correction specifically needs online RL on the model's own live error distribution, because offline correction traces don't match the errors that actually show up at test time Why does self-correction training on offline data fail?. So filtering is necessary, but the surprising takeaway is that it's a proxy: the thing actually preventing compounding is a verification signal the model can't fake to itself.


Sources 8 notes

How quickly do errors compound during model self-training?

Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Why does filtering for correct examples prevent error compounding in self-training?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and propose that filtering works by inserting external verification the model cannot fake to itself:
• Models systematically over-trust their own outputs during self-evaluation, causing small errors to avalanche exponentially within 2–3 iterations without filtering (~2024-03).
• Correctness filtering enables exponential length generalization in transformers (10→100 digits) with no saturation; without it, noise amplifies rather than signal (~2025-02).
• The same gating principle appears in asymmetric self-play (majority-vote verification) and bidirectional RAG (entailment/attribution gates); all require a verification signal the model cannot bribe (~2024–2025).
• Overly hard filter thresholds induce degenerate shortcuts (answer-repetition, computation-skipping) that re-open the avalanche; the filter's honesty is only as good as the verifier (~2026-05).
• Offline SFT on model-generated correction traces fails due to distribution mismatch with live test errors; filtering picks *which* examples, but online RL on live error distribution is needed for recovery (~2024-09).

Anchor papers (verify; mind their dates):
• arXiv:2403.09972 (2024-03) — Self-detection bias in LLM self-evaluation.
• arXiv:2502.01612 (2025-02) — Self-improving transformers and length generalization.
• arXiv:2605.28388 (2026-05) — Sample difficulty and degenerate behaviors in RLVR.
• arXiv:2409.12917 (2024-09) — Self-correction via RL and distribution mismatch.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 4), methods (process reward models, outcome supervision at scale), tooling (verification SDKs, automated checkers), or orchestration (multi-agent ensembles, cached reasoning) have since relaxed or overturned it. Separate the durable question—does external verification remain necessary to prevent error compounding?—from the perishable limitation (e.g., does modern RL on live errors now avoid distribution mismatch?). Cite what resolved it; flag where constraints still hold.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Has anyone shown a model that self-corrects without external filtering, or proven filtering is redundant under certain training regimes?
(3) Propose 2 research questions that assume the regime may have moved: (a) If verification is now cheap and scalable, what *new* bottleneck (e.g., reward hacking, coverage of edge cases) replaces error compounding? (b) Can a model learn to *generate* its own trustworthy verification signal—i.e., build internal metrics that match external ground truth—and if so, does that dissolve the external-verification requirement?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines