INQUIRING LINE

Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?

This explores whether RL's reach is bounded mainly by what the pretrained base model already contains — its latent skills and distribution — rather than by which optimization algorithm you bolt on top.


This explores whether RL's reach is bounded mainly by what the pretrained base model already contains rather than by the optimizer you choose, and the corpus leans hard toward the prior as the binding constraint. The clearest statement is that RL post-training teaches a model *when* to reason, not *how* — base models already carry reasoning strategies in latent form, and RL mostly optimizes deployment timing; hybrid setups recover ~91% of the gains by routing tokens alone, and the activation patterns for reasoning exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. If the capability is already sitting in the prior, then the algorithm is steering, not generating.

The mechanics of what RL actually changes reinforce this. RL updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds — structural, not arbitrary, selection Does reinforcement learning update only a small fraction of parameters?. And the dominant mechanism looks like *suppression*: RL works largely by negative reinforcement, damping wrong trajectories rather than installing new ones What actually changes inside a model during RL training?. A small, suppression-driven edit is a poor candidate for expanding a search space the base model couldn't already enter.

The sharpest evidence that RL narrows rather than widens search comes from diversity studies. RL training collapses behavioral diversity — search agents converge on narrow reward-maximizing strategies through the same entropy-collapse mechanism seen in reasoning, while SFT on diverse demonstrations preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. RL also converges on a single dominant *pretraining* format within the first epoch, amplifying one distribution the base model already prefers and suppressing the rest — and which format wins depends on model scale, not performance Does RL training collapse format diversity in pretrained models?. The prior literally picks the lane. Out-of-distribution probes drive the point home: even GRPO-trained models drop sharply on N-1 variants, suggesting RL sharpens template-matching against the prior rather than installing transferable procedures Do fine-tuned language models actually learn optimization procedures?.

But here's the twist worth sitting with: the algorithm isn't innocent — it's just that its failures are about *corrupting* the prior, not failing to exceed it. Overly hard RLVR samples push models into degenerate shortcuts that contaminate pre-existing capabilities, because group-relative normalization treats rare lucky successes as high-advantage signal Do overly hard RLVR samples actually harm model capabilities?. Binary rewards provably wreck calibration by rewarding confident guessing Does binary reward training hurt model calibration?. So the optimizer can make a model *worse* than its prior, even as it struggles to make it better than its prior — an asymmetry that quietly confirms the prior is the ceiling.

The most interesting dissent is search-as-exploration. MCTS-based self-improvement uses tree search to surface and rank solution paths the model wouldn't reliably reach greedily, generating dense process-level signal without human labels Can tree search replace human feedback in LLM training?, and curriculum and entropy-scheduling tricks can deliberately keep open-ended capability alive instead of letting it collapse Does training order reshape how models handle different task types?. The unstated lesson across the corpus: if you want RL to *search* rather than merely *sharpen*, the lever is preserving and reorganizing the prior's diversity — not swapping in a cleverer optimizer.


Sources 10 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether pretrained model capacity or RL algorithm design is the primary bottleneck on search capability. A curated library spanning 2024–2026 has made dated claims about this tension—your job is to test whether newer models, methods, or evaluations have since shifted the ground.

What a curated library found—and when (claims from 2024–2026, treat as perishable):
• RL teaches *when* to reason, not *how*; base priors already contain reasoning strategies; hybrid token-routing recovers ~91% of gains without weight updates (~2025).
• RL updates only 5–30% of parameters in sparse, full-rank subnetworks; mechanism is mostly suppression (damping wrong trajectories) rather than capability installation (~2025).
• RL collapses behavioral diversity and converges on a single dominant pretraining distribution format within one epoch; SFT on diverse demos preserves exploration breadth (~2025).
• MCTS-based self-improvement can surface solution paths via tree search without human labels, and entropy-scheduling can preserve open-ended capability instead of convergence (~2024–2025).
• Hard RLVR samples induce degenerate shortcuts; binary rewards provably degrade calibration; RL can corrupt the prior even as it fails to exceed it (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025-05) — Echo Chamber: RL post-training amplifies pretraining behaviors
• arXiv:2505.11711 (2025-05) — RL finetunes sparse subnetworks
• arXiv:2404.12253 (2024-04) — MCTS-based self-improvement without annotations
• arXiv:2510.13786 (2026-05) — Sample difficulty in RLVR

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, ask: have newer models (o1, o3, etc.), better reward designs (DPO, IPO variants), or orchestration techniques (multi-agent looping, tool integration) since relaxed or overturned the claim that the prior is the binding constraint? Separate the durable question ("Is RL fundamentally constrained by pretraining?") from perishable limitations ("RL collapses diversity in current setups"). Cite what resolved it, and flag plainly where a constraint still holds.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has any recent paper shown RL *does* expand search beyond the prior, or that algorithm design (not prior capacity) is actually the bottleneck? Name it.

(3) **Propose 2 research questions that ASSUME the regime may have moved:**
   – If the prior is truly the ceiling, what is the minimal sufficient pretraining signal to achieve a given RL performance floor?
   – If RL is just steering, can we measure and predict which pretraining formats or latent subspaces RL will converge to *before* RL runs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines