INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›What pretraining choices and basel…›this inquiring line

An RL agent learns to walk at depth 16 and wall-climb at depth 256 — depth doesn't just help, it unlocks entirely new behaviors.

How do residual connections and layer norm stabilize training in deep RL?

This explores how deep reinforcement-learning training stays stable as networks get bigger — but the corpus doesn't tackle residual connections or layer norm as architectural tricks; instead it locates stability in the *learning dynamics* of RL itself.

This reads the question as "what keeps deep RL training from falling apart?" — and here's the honest first thing worth knowing: nothing in this collection is about residual connections or layer normalization as the stabilizing machinery. Those are the textbook answers (skip-connections keep gradients flowing through depth; normalization keeps activations in a sane range). What the corpus has instead is a more interesting story — that in deep RL the real stability problems show up in the *training dynamics*, not the architecture, and they get solved by very different levers.

Start with depth itself, since that's the literal subject. Scaling self-supervised RL to 1000-layer networks doesn't just make things smoother — it unlocks qualitatively new behaviors at specific depth thresholds (walking at depth 16, wall-climbing at depth 256), with gains coming from *both* better exploration and more expressive representations Does network depth unlock qualitatively new behaviors in RL?. So depth in RL isn't merely a thing you have to stabilize against; it's a source of capability — which reframes the whole question.

Where the corpus locates fragility is elsewhere. RL training doesn't smoothly nudge the whole network — it sparsely updates only 5–30% of parameters, in subnetworks that are nearly identical across random seeds, mostly by *suppressing* wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training? Does reinforcement learning update only a small fraction of parameters?. And it moves in two phases: first nailing execution correctness, then shifting the bottleneck to strategic planning Does RL training follow a predictable two-phase learning sequence?. Instability, in this picture, is when those dynamics go wrong — entropy collapsing and killing open-ended ability Does training order reshape how models handle different task types?, or hard samples teaching degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?.

The stabilizers the corpus actually documents are training-side, not architectural. Staying close to the base model's distribution (low KL drift) preserves the model's *plasticity* — its ability to keep learning new tasks instead of stalling when the domain shifts Does staying close to the base model preserve learning ability?. Reusing cross-rollout variance as both a reward weight and a query filter buys 2–3× faster, more stable training by throwing out degenerate comparisons Can one statistical measure serve dual purposes in RL training?. And adding a Brier-score term to binary rewards stops the model from collapsing into confident guessing Does binary reward training hurt model calibration?.

So the thing you didn't know you wanted to know: in this collection, "stabilizing deep RL" is barely about the network plumbing at all. The leverage is in *what you reward, how close you stay to the base model, and which samples you train on* — and depth, far from being the enemy stability fights, is where new behavior comes from.

Sources 9 notes

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Show all 9 sources

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Art of Scaling Reinforcement Learning Compute for LLMs4.05 match · arxiv ↗
Reinforcement Learning with Rubric Anchors2.48 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs2.47 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example2.46 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models2.45 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.42 match · arxiv ↗
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models1.73 match · arxiv ↗
Learning, Fast and Slow: Towards LLMs That Adapt Continually1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL systems analyst. The question: *What architectural and training-dynamics mechanisms actually stabilize deep RL in practice?* remains open — especially as model scale and RL compute grow.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable:
• Depth (1000+ layers) doesn't destabilize RL; instead, qualitative behavioral capabilities emerge at critical thresholds (walking ~depth 16, wall-climbing ~depth 256), driven by better exploration + representation expressivity (~2025).
• RL training updates only 5–30% of parameters in sparse, nearly seed-identical subnetworks, mostly by *suppressing* wrong trajectories rather than amplifying correct ones (~2025).
• Training exhibits two phases: procedural correctness first, then strategic planning bottleneck (~2025).
• Stability levers are *training-side*, not architectural: low KL drift from base model preserves plasticity; cross-rollout variance reuse yields 2–3× faster, more stable training; proper scoring rules (Brier) prevent reward collapse (~2024–2026).
• Entropy dynamics in multi-task RL show complementary collapse patterns; hard samples induce degenerate shortcuts (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2503.14858 (2025-03) — 1000-layer networks and qualitative behavioral jumps
• arXiv:2505.11711 (2025-05) — sparse subnetwork updates in RL finetuning
• arXiv:2605.12484 (2026-05) — sample difficulty mechanisms in RLVR
• arXiv:2604.28388 (2025-11) — weight-sparse circuits and interpretability

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, latest Claude), post-training methods (DPO variants, synthetic data pipelines, constitutional AI), or evaluation harnesses have since relaxed or overturned it. Separate durable questions (likely still open: *how do we prevent collapse in ultra-long rollouts?*) from perishable limitations (possibly resolved: *depth destabilizes*). Cite what resolved it.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months — especially any paper claiming residual connections or layer norm DO matter in deep RL, or any showing the sparse-update story is wrong.
(3) Propose 2 research questions that assume the regime *has* moved: e.g., *Do extreme-scale RL (>10B model, >1T tokens) still exhibit two-phase training, or does it collapse into a single continuous dynamic?* or *Can we predict degenerate-shortcut induction before it happens, rather than filtering it post-hoc?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An RL agent learns to walk at depth 16 and wall-climb at depth 256 — depth doesn't just help, it unlocks entirely new behaviors.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8