SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Most RL research uses shallow architectures (2-5 layers). Scaling network depth to 1024 layers in self-supervised RL produces 2x-50x performance improvements — but not through gradual improvement. Instead, there are pronounced jumps at critical depth thresholds that vary by environment: depth 4 produces rudimentary policies (falling, throwing toward target), depth 16 enables walking upright, depth 64 navigates simple mazes, and depth 256 produces entirely novel behaviors (leveraging body position to propel over walls, shifting into seated postures to worm through obstacles).

The mechanism is a synergy between exploration and expressivity. A controlled experiment separates these factors: deep and shallow "learner" networks train on data collected by a separate "collector" network. When the collector is deep (rich exploration data), the deep learner substantially outperforms the shallow one — expressivity matters. When the collector is shallow (poor exploration data), both learners perform equally poorly — exploration constrains everything. Neither factor alone explains the gains; scaling depth enhances both simultaneously.

This is conducted in unsupervised goal-conditioned settings with no demonstrations or rewards — the agent must explore from scratch and learn to reach commanded goals. The self-supervised contrastive RL algorithm provides the learning framework. Stabilization requires residual connections, layer normalization, and Swish activations.

The finding challenges the conventional wisdom that RL provides too few bits of feedback to train large networks. In self-supervised RL specifically, the ratio of feedback to parameters becomes less constraining because the agent generates its own training signal. Since Why does parallel reasoning outperform single chain thinking?, the depth-scaling result offers a complementary axis: scaling depth may be as important as scaling parallel breadth for unlocking qualitatively new capabilities.

Inquiring lines that use this note as a source 9

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

network depth above critical thresholds causes qualitative behavioral jumps in self-supervised rl