How do RL training and base models differ in creating MI peaks?
This reads the question as: where do reinforcement-learning training and the underlying base model actually diverge in *concentrating* capability — the corpus doesn't track 'mutual information peaks' by that name, but it has a lot on whether RL builds new peaks of reasoning or just sharpens ones the base model already had.
This explores where RL and base models differ in concentrating capability — whether RL carves genuinely new reasoning peaks or just amplifies structure already latent in the pretrained model. Worth flagging up front: nothing in the corpus measures 'mutual information peaks' under that label, so this synthesizes the closest territory — how RL reshapes the base model's distribution rather than building from scratch.
The strongest claim that RL *deploys* rather than *creates* comes from work arguing RL post-training teaches a model *when* to reason, not *how* — base models already carry reasoning strategies in latent form, and hybrid models recover ~91% of the gains just by routing tokens, with strategy-activation vectors detectable before any RL touches the weights Does RL post-training create reasoning or just deploy it?. That picture is reinforced mechanically: RL updates only 5–30% of parameters, in sparse-but-full-rank subnetworks that are nearly identical across random seeds, and works mostly by *suppressing* wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training? Does reinforcement learning update only a small fraction of parameters?. In this telling, RL doesn't build a new peak — it sharpens one base-model mode and collapses the alternatives, which is exactly what the format-convergence work shows: RL locks onto a single dominant pretraining format within the first epoch and quietly buries the others Does RL training collapse format diversity in pretrained models?.
But the opposing camp has direct counterevidence: prolonged RL with KL control, policy resetting, and non-mathematical tasks beats the base model at *every* pass@k level — not just at greedy sampling — which is the signature of genuinely new reasoning strategies the base model cannot reach by any amount of resampling Can reinforcement learning discover reasoning strategies base models cannot?. So the difference between 'RL deploys' and 'RL creates' may itself depend on the recipe and the domain: base models have established patterns in math, so RL there looks like redeployment, while in domains where the base lacks patterns, RL appears to build new peaks.
The reason these two stories can both be true is that *where* RL concentrates capability is fragile and recipe-dependent. Drift away from the base distribution destroys plasticity — staying within ~70% of the base distribution preserves the ability to keep learning, while parameter-heavy updates stall when the domain shifts Does staying close to the base model preserve learning ability?. Push RL on impossible problems and it doesn't form a new peak at all — it amplifies degenerate shortcuts that then contaminate capabilities the base model already had Do overly hard RLVR samples actually harm model capabilities?. Even calibration — how the model's confidence concentrates — gets flattened the wrong way: binary rewards push confident guessing because the base model's well-calibrated uncertainty isn't preserved unless you add a proper scoring term Does binary reward training hurt model calibration?.
The thing you might not have known you wanted: the divergence between RL and base models isn't one effect but a *scheduling* phenomenon. Training order mechanically reshapes entropy — structured tasks shrink output entropy while creative ones expand it, so doing structured work first protects open-ended capability from collapse Does training order reshape how models handle different task types?. RL's relationship to the base model, in other words, isn't fixed by the algorithm; it's set by what you train, in what order, and how far you let the model drift — and the whole trajectory is predictable enough to follow sigmoid curves whose ceiling the recipe decides in advance Does RL training follow predictable scaling curves?.
Sources 10 notes
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.