Does the model learn depth-wise drift as an explicit strategy?
This reads the question two ways at once — does a model *deliberately* drift away from its starting point as a learned tactic, or does 'drift' (and the depth at which it happens) emerge as an unplanned side effect of training? — and the corpus comes down firmly on the second.
This explores whether a model treats drifting away from its base behavior — and doing so layer-by-layer through its depth — as something it learns on purpose, or whether drift is an emergent byproduct of how training pushes on weights. The collection's answer is consistent: drift is mostly an unintended consequence, not a strategy, and where it lands in the network is structured rather than chosen.
Start with what drift costs. Models that stay close to their original distribution keep their ability to learn new things; the further a model drifts in KL terms, the more its plasticity erodes, so low-drift training is something you engineer *for*, not a tactic the model elects Does staying close to the base model preserve learning ability?. That reframes drift as a side effect to be minimized, not a learned skill. The same theme shows up at decoding time: proxy-tuning deliberately leaves base weights untouched precisely because direct fine-tuning corrupts knowledge stored in the lower layers, while the useful behavioral shift can be applied as a distributional nudge that mostly touches reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So drift has a depth signature — it hits lower layers hardest — but that signature is a vulnerability, not a plan.
Where the question gets interesting is *how non-random* the change is. RL doesn't smear updates everywhere: it consistently touches only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That reproducibility looks strategic, but it's better read as structural — the training dynamics keep landing in the same place. Similarly, RL collapses a model's many pretraining formats down to one dominant format within the first epoch, and which format wins depends on model scale rather than on which format performs best Does RL training collapse format diversity in pretrained models?. The model 'drifts' toward a single mode, but it isn't choosing it for good reasons — the convergence is a property of the optimizer, not a deliberation.
Now the depth half. There's real evidence that *depth itself* does structured work: deep-and-thin sub-billion models beat balanced ones by composing abstract concepts through successive layers Does depth matter more than width for tiny language models?, and scaling self-supervised RL to a thousand layers produces sharp behavioral jumps at specific depth thresholds — depth 16 to walk, depth 256 to wall-climb — driven by depth unlocking exploration and expressivity, not gradual drift Does network depth unlock qualitatively new behaviors in RL?. So 'depth-wise' change is real and consequential, but it reads as capability emerging at thresholds, not as a strategy the model narrates to itself.
The honest synthesis: the corpus has no note showing a model that *learns drift as an explicit strategy*. What it has is the opposite picture — drift is a cost you regularize against, its depth profile is a liability, and the genuinely structured, reproducible parts of training (sparse subnetworks, format collapse, threshold jumps) are emergent regularities of the optimizer and architecture. If you want the closest thing to *deliberate* strategic behavior in the collection, it lives not in weight-drift but in the work on reasoning tactics — penalizing premature thought-switching Do reasoning models switch between ideas too frequently? and forcing breadth-first exploration through learned abstractions Can abstractions guide exploration better than depth alone? — which is where 'explicit strategy' actually gets engineered, just not as drift.
Sources 8 notes
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.