INQUIRING LINE

What scaling properties emerge from RL training dynamics beyond verification?

This explores what predictable patterns show up as you scale RL training of language models — not whether the rewards are verifiable, but what the training dynamics themselves do as they grow.


This explores what predictable patterns show up as you scale RL training of language models — not whether the rewards are verifiable, but what the training process itself reliably does. The surprising thread running through the corpus is that RL training is far more *lawful* than its reputation suggests. At the largest scale studied (400K GPU-hours, 200+ models), performance climbs along a sigmoid curve where the recipe you pick sets the ceiling and implementation details only change how fast you get there — meaning you can extrapolate a big run's outcome from a small one Does RL training follow predictable scaling curves?. That predictability extends inward: training tends to move through two phases, first locking in procedural correctness, then shifting the bottleneck to strategic planning, with entropy rising on planning tokens while execution stabilizes Does RL training follow a predictable two-phase learning sequence?.

But the most counterintuitive scaling property is how *little* of the model actually moves. Across seven algorithms and ten model families, RL rewires only 5–30% of parameters — yet those sparse updates are nearly full-rank and almost identical across random seeds, so the network isn't picking parameters arbitrarily, it's converging on a structurally determined subnetwork Does reinforcement learning update only a small fraction of parameters?. The mechanism underneath looks like suppression more than amplification: RL mainly learns to *stop* producing wrong trajectories rather than to invent new right ones What actually changes inside a model during RL training?. This dovetails with the now well-supported finding that RL surfaces capabilities already latent in pretraining rather than building new ones — it teaches a model *when* to reason, not *how*, recovering most gains just by routing tokens toward strategies that pre-exist before any RL begins Does RL post-training create reasoning or just deploy it?, How does RL training reshape reasoning and what gets lost?, Does RLVR actually expand what models can reason about?.

The darker scaling properties are the collapses. As RL runs, it tends to converge on a single dominant pretraining format and suppress the alternatives — and which format wins depends on model scale rather than on which format performs best, a homogenization you'd never see if you only looked at proprietary models Does RL training collapse format diversity in pretrained models?. Diversity narrows along other axes too: binary correctness rewards systematically push models toward confident wrong answers because nothing penalizes overconfidence, quietly degrading calibration as you scale up reward optimization Does binary reward training hurt model calibration?. And the curriculum matters mechanically — structured tasks drive entropy down while creative tasks drive it up, so the *order* you train domains in determines whether open-ended capability survives Does training order reshape how models handle different task types?.

What you didn't know you wanted to know: scale interacts badly with difficulty. Training on problems that are too hard doesn't just waste signal — group-relative normalization treats a rare accidental success as a hugely advantageous trajectory, so the model amplifies shortcuts like answer-repetition and computation-skipping, and those degenerate habits then contaminate capabilities it already had Do overly hard RLVR samples actually harm model capabilities?. The corpus also hints at the escape route: the same self-supervised statistic — variance across rollouts — can do double duty as both a dense reward signal and a query filter, throwing out the degenerate comparisons before they corrupt training and getting 2–3× faster, more stable runs on tasks with no verifier at all Can one statistical measure serve dual purposes in RL training?. Taken together, the picture is that RL at scale is a sharpening tool, not a growth tool: it concentrates, suppresses, and homogenizes along sigmoid curves you can predict — which is exactly why what you feed it (difficulty, format diversity, calibration terms, task order) decides whether scaling helps or quietly hollows the model out.


Sources 12 notes

Does RL training follow predictable scaling curves?

Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating RL training dynamics in language models. The core question remains open: what *scaling laws* and phase transitions govern RL post-training, independent of reward verifiability?

What a curated library found — and when (findings span 2024–2026; treat as dated claims):
• RL training follows predictable sigmoid trajectories where recipe sets the ceiling; small runs extrapolate to large ones (~400K GPU-hours across 200+ models, 2025).
• RL updates only 5–30% of parameters, but those subnetworks are full-rank and seed-invariant — suggesting structural convergence, not random selection (2025).
• RL surfaces latent capabilities; it teaches *when* to reason, not *how*. Most gains recover pre-existing strategies (2025).
• RL converges on a single dominant pretraining format, suppressing alternatives in model-scale-dependent ways — a homogenization invisible in proprietary models (2025).
• Binary rewards degrade calibration at scale; curriculum order (structured vs. creative domains) mechanically determines whether open-ended capability survives (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 (May 2025) — parameter subnetwork sparsity & full-rank structure
• arXiv:2504.13837 (Apr 2025) — reasoning capacity boundaries; RL as capability surfacing
• arXiv:2504.07912 (Apr 2025) — echo chamber / pretraining distribution amplification
• arXiv:2605.28388 (May 2026) — sample difficulty phase transitions in RLVR

Your task:
(1) RE-TEST EACH CONSTRAINT. For sigmoid predictability, sparse full-rank updates, and latent-capability recovery: do newer model scales, training algorithms (e.g., online RL, synthetic data injection), or evaluation harnesses (better calibration metrics, multi-domain benchmarks) since mid-2026 relax these? Separate the durable question (are scaling laws universal across architectures?) from perishable limits (do those limits still bind at current model sizes?).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — particularly any showing RL *does* expand capability beyond pretraining, or that sigmoid ceilings are breached by curriculum or architecture changes.
(3) Propose 2 research questions that ASSUME the regime has moved: (a) if homogenization is now controllable via diversity-preserving rewards, what is the cost in convergence speed? (b) if sample difficulty no longer triggers shortcut amplification (via better filtering or online methods), do we recover the capability gains while staying calibrated?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines