INQUIRING LINE

What distinguishes surface mechanisms from the training regimes that produce them?

This explores the gap between what a model *appears* to do at the surface — its output formats, behaviors, reasoning moves — and the training dynamics that actually installed those behaviors, and why the two are easy to confuse.


This explores the gap between what a model *appears* to do at the surface and the training process that produced it — and the corpus's recurring lesson is that surface behavior systematically misleads you about its own origin. The cleanest case is instruction tuning: models trained on semantically empty or deliberately *wrong* instructions perform about as well as models trained on correct ones (43% vs. a 42.6% baseline). The surface story — 'the model learned to understand the task' — is false; what actually transferred was knowledge of the output space, not task comprehension Does instruction tuning teach task understanding or output format?. The mechanism (right-looking answers) and the regime (learning a format distribution) are different things, and you only catch the difference by perturbing the training, not by reading the outputs.

The same split runs through reinforcement learning. A model that suddenly 'reasons better' after RL looks like it acquired a new skill, but the training regime tells a different story: RLVR mostly acts as a catalyst that surfaces strategies already latent in the pretrained prior, with updates that are structurally sparse and bounded by what pretraining already contained How does RL training reshape reasoning and what gets lost?. Mechanically, the change is even stranger than it looks — RL concentrates in a sparse but full-rank subnetwork, updating only 5–30% of parameters, and its primary lever is *suppressing* wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training?. So the surface gain ('new reasoning') is produced by a regime of negative reinforcement on a fixed capability set — a mechanism almost opposite to the intuitive read.

But the distinction isn't a clean 'surface always lies.' The regime determines *when* a surface behavior is genuinely new versus merely activated. Capability creation turns out to be domain-conditional: for standard reasoning, RL activates latent abilities; for complex multi-step planning, the same training generates genuinely novel strategies a base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. And the regime has internal structure of its own — RL training moves through a predictable two-phase arc, first consolidating execution correctness, then shifting the bottleneck to strategic planning Does RL training follow a predictable two-phase learning sequence?. Which surface skill you see depends on which phase, and even which *format*, the regime happened to amplify — RL tends to collapse onto a single dominant pretraining format, often selected by model scale rather than performance Does RL training collapse format diversity in pretrained models?.

This is exactly why you can't infer the regime from the mechanism by inspection — and why the corpus insists on causal, not just representational, analysis. Identifying a feature that correlates with a behavior tells you a surface mechanism exists; only intervening on it tells you the training actually wired it that way, which is why complete mechanistic claims need both representational *and* causal methods Can we understand LLM mechanisms with only representational analysis?. The deeper warning is that identical benchmark scores can sit on top of radically different internal structures, and pushing one surface capability reliably degrades another out of view What really happens inside a language model?. Two models with the same mechanism on the test can carry entirely different regimes underneath.

The payoff for a curious reader: the field is converging on 'learning mechanics' — borrowing from statistical physics — as the frame that explains surface behavior through *training dynamics* rather than static architecture, treating the trajectory of learning as the real object of study Can deep learning theory unify around training dynamics?. Even something as basic as training *order* leaves a mechanical fingerprint: train structured tasks before creative ones and you prevent an entropy collapse that would otherwise quietly damage open-ended ability Does training order reshape how models handle different task types?. The behavior you see at the surface is a shadow cast by the regime — and the whole research program is about learning to read the shadow back to its source.


Sources 10 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

What really happens inside a language model?

Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).

Can deep learning theory unify around training dynamics?

Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Next inquiring lines