INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›Why do continual learning scenario…›this inquiring line

When an AI model drifts from how it was trained, is that a deliberate tactic — or just an accident of learning?

Does the model learn depth-wise drift as an explicit strategy?

This reads the question two ways at once — does a model *deliberately* drift away from its starting point as a learned tactic, or does 'drift' (and the depth at which it happens) emerge as an unplanned side effect of training? — and the corpus comes down firmly on the second.

This explores whether a model treats drifting away from its base behavior — and doing so layer-by-layer through its depth — as something it learns on purpose, or whether drift is an emergent byproduct of how training pushes on weights. The collection's answer is consistent: drift is mostly an unintended consequence, not a strategy, and where it lands in the network is structured rather than chosen.

Start with what drift costs. Models that stay close to their original distribution keep their ability to learn new things; the further a model drifts in KL terms, the more its plasticity erodes, so low-drift training is something you engineer *for*, not a tactic the model elects Does staying close to the base model preserve learning ability?. That reframes drift as a side effect to be minimized, not a learned skill. The same theme shows up at decoding time: proxy-tuning deliberately leaves base weights untouched precisely because direct fine-tuning corrupts knowledge stored in the lower layers, while the useful behavioral shift can be applied as a distributional nudge that mostly touches reasoning and style Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So drift has a depth signature — it hits lower layers hardest — but that signature is a vulnerability, not a plan.

Where the question gets interesting is *how non-random* the change is. RL doesn't smear updates everywhere: it consistently touches only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds Does reinforcement learning update only a small fraction of parameters?. That reproducibility looks strategic, but it's better read as structural — the training dynamics keep landing in the same place. Similarly, RL collapses a model's many pretraining formats down to one dominant format within the first epoch, and which format wins depends on model scale rather than on which format performs best Does RL training collapse format diversity in pretrained models?. The model 'drifts' toward a single mode, but it isn't choosing it for good reasons — the convergence is a property of the optimizer, not a deliberation.

Now the depth half. There's real evidence that *depth itself* does structured work: deep-and-thin sub-billion models beat balanced ones by composing abstract concepts through successive layers Does depth matter more than width for tiny language models?, and scaling self-supervised RL to a thousand layers produces sharp behavioral jumps at specific depth thresholds — depth 16 to walk, depth 256 to wall-climb — driven by depth unlocking exploration and expressivity, not gradual drift Does network depth unlock qualitatively new behaviors in RL?. So 'depth-wise' change is real and consequential, but it reads as capability emerging at thresholds, not as a strategy the model narrates to itself.

The honest synthesis: the corpus has no note showing a model that *learns drift as an explicit strategy*. What it has is the opposite picture — drift is a cost you regularize against, its depth profile is a liability, and the genuinely structured, reproducible parts of training (sparse subnetworks, format collapse, threshold jumps) are emergent regularities of the optimizer and architecture. If you want the closest thing to *deliberate* strategic behavior in the collection, it lives not in weight-drift but in the work on reasoning tactics — penalizing premature thought-switching Do reasoning models switch between ideas too frequently? and forcing breadth-first exploration through learned abstractions Can abstractions guide exploration better than depth alone? — which is where 'explicit strategy' actually gets engineered, just not as drift.

Sources 8 notes

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Show all 8 sources

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether LLMs learn depth-wise drift as an explicit strategy. The question remains open: does a model *choose* to drift layer-by-layer as a learned tactic, or is drift an emergent byproduct of optimizer dynamics and architecture?

What a curated library found — and when (dated claims, not current truth):
Library findings span 2023–2026. Key constraints on "explicit strategy":
• Drift erodes plasticity (KL-distance cost); it is engineered *against*, not learned as a tactic (2024).
• RL updates only 5–30% of parameters in reproducible sparse subnetworks across seeds — structured, but a property of optimizer, not deliberation (2025-05, arXiv:2505.11711).
• RL post-training converges on a single dominant pretraining format within epoch 1, driven by model scale, not performance (2025-04, arXiv:2504.07912).
• Depth unlocks qualitative behavioral jumps at specific thresholds (depth 16, 256); emergence, not gradual strategic drift (2025-03, arXiv:2503.14858).
• Explicit strategic behavior (penalizing premature reasoning transitions, breadth-first exploration) is engineered in *reasoning tactics*, not weight drift (2025-01, arXiv:2501.18585).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 (2025-05): RL finetunes small subnetworks — sparse but full-rank.
• arXiv:2504.07912 (2025-04): Echo Chamber — RL amplifies pretraining, format collapse in epoch 1.
• arXiv:2503.14858 (2025-03): 1000-layer self-supervised RL — depth thresholds trigger jumps.
• arXiv:2501.18585 (2025-01): Underthinking o1-like models — reasoning transitions penalized explicitly.

Your task:
(1) RE-TEST: For each constraint (plasticity cost, sparse subnetworks, format collapse, depth thresholds), has newer training (e.g., constitutional AI, mixture-of-experts routing), longer post-training horizons, or stronger mechanistic probes *overturned* whether drift is unintentional? Separate the durable question (is drift learned?) from perishable limitations (optimizer happens to produce sparse updates). Cite what relaxed each constraint; flag where it still holds.
(2) Surface contradicting or superseding work from the last ~6 months: any papers showing models *do* learn depth-wise drift as a deliberate tactic, or evidence of reward-driven layer-selective strategy?
(3) Propose 2 research questions assuming the training regime has shifted: (a) Do scaling post-training horizons or multi-objective reward signals reveal *learned* depth-targeting? (b) Can mechanistic attribution show whether sparse subnetwork choice is path-dependent strategy or pure structural inevitability?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI model drifts from how it was trained, is that a deliberate tactic — or just an accident of learning?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8