INQUIRING LINE

How do RL training and base models differ in creating MI peaks?

This reads the question as: where do reinforcement-learning training and the underlying base model actually diverge in *concentrating* capability — the corpus doesn't track 'mutual information peaks' by that name, but it has a lot on whether RL builds new peaks of reasoning or just sharpens ones the base model already had.


This explores where RL and base models differ in concentrating capability — whether RL carves genuinely new reasoning peaks or just amplifies structure already latent in the pretrained model. Worth flagging up front: nothing in the corpus measures 'mutual information peaks' under that label, so this synthesizes the closest territory — how RL reshapes the base model's distribution rather than building from scratch.

The strongest claim that RL *deploys* rather than *creates* comes from work arguing RL post-training teaches a model *when* to reason, not *how* — base models already carry reasoning strategies in latent form, and hybrid models recover ~91% of the gains just by routing tokens, with strategy-activation vectors detectable before any RL touches the weights Does RL post-training create reasoning or just deploy it?. That picture is reinforced mechanically: RL updates only 5–30% of parameters, in sparse-but-full-rank subnetworks that are nearly identical across random seeds, and works mostly by *suppressing* wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training? Does reinforcement learning update only a small fraction of parameters?. In this telling, RL doesn't build a new peak — it sharpens one base-model mode and collapses the alternatives, which is exactly what the format-convergence work shows: RL locks onto a single dominant pretraining format within the first epoch and quietly buries the others Does RL training collapse format diversity in pretrained models?.

But the opposing camp has direct counterevidence: prolonged RL with KL control, policy resetting, and non-mathematical tasks beats the base model at *every* pass@k level — not just at greedy sampling — which is the signature of genuinely new reasoning strategies the base model cannot reach by any amount of resampling Can reinforcement learning discover reasoning strategies base models cannot?. So the difference between 'RL deploys' and 'RL creates' may itself depend on the recipe and the domain: base models have established patterns in math, so RL there looks like redeployment, while in domains where the base lacks patterns, RL appears to build new peaks.

The reason these two stories can both be true is that *where* RL concentrates capability is fragile and recipe-dependent. Drift away from the base distribution destroys plasticity — staying within ~70% of the base distribution preserves the ability to keep learning, while parameter-heavy updates stall when the domain shifts Does staying close to the base model preserve learning ability?. Push RL on impossible problems and it doesn't form a new peak at all — it amplifies degenerate shortcuts that then contaminate capabilities the base model already had Do overly hard RLVR samples actually harm model capabilities?. Even calibration — how the model's confidence concentrates — gets flattened the wrong way: binary rewards push confident guessing because the base model's well-calibrated uncertainty isn't preserved unless you add a proper scoring term Does binary reward training hurt model calibration?.

The thing you might not have known you wanted: the divergence between RL and base models isn't one effect but a *scheduling* phenomenon. Training order mechanically reshapes entropy — structured tasks shrink output entropy while creative ones expand it, so doing structured work first protects open-ended capability from collapse Does training order reshape how models handle different task types?. RL's relationship to the base model, in other words, isn't fixed by the algorithm; it's set by what you train, in what order, and how far you let the model drift — and the whole trajectory is predictable enough to follow sigmoid curves whose ceiling the recipe decides in advance Does RL training follow predictable scaling curves?.


Sources 10 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training follow predictable scaling curves?

Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, probe whether RL and base models create or deploy reasoning peaks differently — treating recent claims as dated and testable, not settled fact.

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2024–May 2026. A curated library identified:
• RL post-training acts as *deployment* rather than *creation*: base models carry latent reasoning; RL teaches *when* to use it, recovering ~91% of gains via token routing (2025-04, arXiv:2504.07912).
• RL updates only 5–30% of parameters in sparse, reproducible subnetworks; gains come mostly from *suppressing* wrong trajectories, not amplifying right ones (2025-05, arXiv:2505.11711).
• Opposing evidence: prolonged RL with KL control beats base at every pass@k level, suggesting genuinely novel reasoning inaccessible to base-model resampling (2025-05, arXiv:2505.24864).
• Constraint: staying within ~70% KL drift from base preserves plasticity; exceeding that stalls learning on distribution shifts (2026-05, arXiv:2605.12484).
• Training order reshapes entropy mechanically: structured tasks shrink output entropy; sequencing matters, with gains predictable via sigmoid ceiling set by recipe (2025-07, arXiv:2507.14783).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025-04) — Echo Chamber thesis (deployment framing)
• arXiv:2505.24864 (2025-05) — ProRL (creation counterevidence)
• arXiv:2507.14783 (2025-07) — Omni-Thinker (entropy dynamics & task scheduling)
• arXiv:2605.12484 (2026-05) — Learning Fast and Slow (plasticity & KL bounds)

Your task:
(1) RE-TEST THE DEPLOYMENT VS. CREATION DIVIDE. Does the "~91% routing recovery" claim hold under newer scaling (2026–present model sizes)? Has the parameter-sparsity finding (5–30%) been challenged by full-model or adapter methods? For KL-drift constraints, check whether recent methods (e.g., expert mixing, soft LoRA, continued pretrain-style RL) relax the ~70% ceiling. Separate the durable question (when does RL build vs. activate?) from perishable constraints (maybe newer training recipes dissolve the deployment framing entirely).
(2) Surface work from the last 6 months that contradicts or supersedes the "deployment" consensus — especially if newer models show RL *does* create novel peaks, or if mechanistic work revises the parameter-sparsity picture.
(3) Propose two research questions assuming the regime may have moved: (a) Does the deployment/creation boundary shift with model scale or pretraining diversity? (b) Can we design RL recipes that stay within plasticity bounds *while* inducing genuine peak-creation instead of redeployment?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines