INQUIRING LINE

Why does online RL succeed where supervised training fails for self-correction?

This explores why teaching a model to fix its own mistakes works when it practices live (online RL) but breaks down when you simply show it pre-recorded examples of corrections (supervised fine-tuning).


This explores why teaching a model to fix its own mistakes works when it practices live (online RL) but breaks down when you simply show it pre-recorded examples of corrections. The corpus has a sharp, direct answer to the core of this, and a set of adjacent findings that explain the deeper mechanism.

The heart of it is a distribution mismatch. When you train self-correction on offline correction traces, the errors in that training data aren't the errors the model actually makes at test time — so it learns to 'correct' mistakes it would never have produced, and tends to collapse into a single rote correction move regardless of what went wrong Why does self-correction training on offline data fail?. Online RL closes this gap by letting the model trip over its *own* live mistakes and practice recovering from them. The supervision is generated under the same conditions it faces at inference, so there's nothing to mismatch.

What makes this more than a data-hygiene story is what RL is actually doing under the hood. Several notes converge on the idea that RL doesn't install new reasoning — it surfaces and reweights capabilities already latent in the pretrained model. RL updates touch only 5–30% of parameters, in sparse but full-rank, seed-stable subnetworks, suggesting it's making targeted structural adjustments rather than rewriting the model Does reinforcement learning update only a small fraction of parameters?. And verifiable rewards act as catalysts that activate existing pretraining strategies rather than teaching genuinely new ones How does RL training reshape reasoning and what gets lost?. Self-correction is exactly the kind of skill the base model can already *do* but doesn't reliably *deploy* — so the right training signal is one that selectively reinforces deployment on real failures, which is what online practice provides and a static imitation target cannot.

There's a second reason supervised imitation underperforms here: it can only copy the modes present in the data, whereas online training is a feedback loop that can discover the recovery move itself. The corpus is full of variations on 'manufacture the missing supervision from the model's own behavior' — agents treating the consequences of their own actions as the training signal with no external reward Can agents learn from their own actions without external rewards?, tree search ranking a model's own solution paths to replace human annotation Can tree search replace human feedback in LLM training?, self-play loops co-evolving skills against an internal judge Can language models learn skills without human supervision?, and models learning to score their own outputs in unused sequence space Can models learn to evaluate their own work during training?. The common thread: feedback grounded in the model's actual rollouts beats imitation of someone else's trace, because the model is the only source that knows what *it* gets wrong.

But the corpus also warns against reading 'online RL wins' as 'online RL is magic.' The reward shape matters enormously. Binary correctness rewards quietly teach confident guessing because they never punish a confident wrong answer — adding a calibration term fixes it Does binary reward training hurt model calibration?. Feeding RL problems that are too hard breeds degenerate shortcuts that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. And RL tends to collapse onto a single dominant format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. So the honest version of the answer is: online RL succeeds at self-correction not because reinforcement is inherently smarter than supervision, but because it trains on the model's *own live error distribution* and reweights latent skills the model already has — and only when the reward is shaped to reward the right thing.


Sources 10 notes

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing a curated library's claims about online RL vs. supervised training for self-correction. The question remains live: Why does online RL succeed where supervised training fails for self-correction?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–04 through 2026–05. The core claims:
- Distribution mismatch: offline correction training learns to fix errors the model never makes at test time; online RL trains on the model's *own live error distribution* (2024–09, arXiv:2409.12917).
- RL updates only 5–30% of parameters in sparse, full-rank subnetworks, reweighting latent capabilities rather than teaching new skills (2025–05, arXiv:2505.11711).
- Binary correctness rewards degrade calibration; proper scoring rules fix it (2024–09, arXiv:2409.15360).
- RL collapses onto a single dominant format within the first epoch, suppressing solution diversity (2025–04, arXiv:2504.07912).
- Overly-hard RLVR samples induce degenerate shortcuts that contaminate pre-existing skills (2025–10, arXiv:2510.18176).

Anchor papers (verify; mind their dates):
- arXiv:2409.12917 (2024–09): Training Language Models to Self-Correct via Reinforcement Learning
- arXiv:2505.11711 (2025–05): Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- arXiv:2504.07912 (2025–04): Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- arXiv:2510.18176 (2025–10): Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, o1, Claude 3.5 Sonnet, 2025–2026 variants), evaluation methods, reward engineering, or orchestration (multi-step rollouts, beam search, verifier reranking) have since relaxed or overturned it. Separate the durable question—why live feedback beats static imitation—from perishable limitations (calibration collapse, format convergence). Cite what resolved each; flag what still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper shown supervised fine-tuning on corrected traces *can* match online RL under certain conditions? Does any work suggest the distribution mismatch explanation is incomplete or oversimplified?
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., if RL now works across a wider variety of reward shapes, what is the *new* bottleneck? If collapse onto dominant formats has been mitigated, does diversity now hurt or help downstream performance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines