INQUIRING LINE

Why does the pretrained prior determine the exploration ceiling?

This explores why the abilities baked in during pretraining seem to set a hard limit on how far later reinforcement learning can push a model to discover new behaviors — and whether RL adds capability or just selects from what's already there.


This explores why the abilities baked in during pretraining seem to set a hard limit on how far later reinforcement learning can push a model to discover new behaviors. The corpus points to a striking answer: post-training mostly *selects* from a menu the pretrained model already wrote, rather than writing new menu items. Several independent mechanisms — RL steering, critique fine-tuning, decoding tricks, feature steering, and RLVR — all turn out to elicit reasoning that was already latent in base-model activations, which suggests the real bottleneck is elicitation, not capability acquisition Do base models already contain hidden reasoning ability?. If the behavior was never in the prior, no amount of reward-chasing conjures it.

The ceiling becomes visible in how RL narrows things. Reinforcement learning tends to amplify a single dominant format that already existed in the pretraining distribution within the first epoch, while quietly suppressing the alternatives — and which format wins depends on model scale, not necessarily on performance Does RL training collapse format diversity in pretrained models?. The same compression shows up in search agents, where RL collapses behavioral diversity through the familiar entropy-collapse mechanism, converging on narrow reward-maximizing strategies Does reinforcement learning squeeze exploration diversity in search agents?. RL is a funnel: it sharpens the prior's strongest mode and discards the rest. That's powerful when the right behavior is already a strong mode, and useless when it isn't.

The ceiling has a second source — the data a model ever gets to imagine from. Agents trained on static expert demonstrations are capped by what the curators imagined, because they never interact with an environment to learn from their own failures Can agents learn beyond what their training data shows?. So the 'prior' isn't only the base model's weights; it's the horizon of scenarios baked into training. Push beyond that horizon with overly hard problems and the model doesn't reach higher — it learns degenerate shortcuts that even contaminate skills it previously had, because rare accidental successes get treated as high-value trajectories Do overly hard RLVR samples actually harm model capabilities?.

Here's the twist worth knowing: the ceiling is partly about *timing and signal*, not just raw capability. Models commit to choices prematurely because uncertainty signals dominate early transformer layers while the long-horizon 'empowerment' signals that favor exploration only emerge in the middle layers — a temporal mismatch that throttles exploration before it starts Why do large language models explore less effectively than humans?. And the apparent exploration-vs-exploitation trade-off may itself be a measurement artifact that only appears at the token level, vanishing under hidden-state analysis Is the exploration-exploitation trade-off actually fundamental?. So part of the 'ceiling' is the prior failing to surface what it already contains, not a true absence of ability.

Which is exactly why the most promising work moves the action *into* pretraining or stages it carefully. RLP treats chain-of-thought as an exploratory action during pretraining itself, planting reasoning earlier and lifting benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?; and curricula that run supervised RL first to build a richer prior, then RLVR to sharpen it, beat either method alone because the imitation phase creates the reasonable rollouts RL needs to be informative Does sequencing imitation then exploration training improve reasoning?. The through-line: if you want a higher exploration ceiling, you raise the prior — RL alone can only spend what pretraining already deposited.


Sources 9 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst revisiting a question about exploration ceilings in LLM post-training. The question: *Why does the pretrained prior determine the exploration ceiling?* — is it a hard constraint or a measurement artifact that newer training, tooling, or evaluation has since dissolved?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable.
• Post-training mostly *selects* from latent behaviors already in base-model activations; elicitation, not acquisition, is the bottleneck (~2025).
• RL amplifies a single dominant pretraining format within the first epoch while suppressing alternatives; which format wins depends on scale, not performance (~2025).
• Models commit to choices prematurely because uncertainty signals dominate early layers while long-horizon 'empowerment' signals only emerge in middle layers (~2025).
• Expert-demonstration agents are capped by curator imagination and fail on out-of-distribution problems, learning degenerate shortcuts that contaminate prior skills (~2024–2025).
• Chain-of-thought as pretraining action lifts benchmarks ~19%; curriculum (imitation then RLVR) beats either alone (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2501.18009 (2025-01): "Large Language Models Think Too Fast To Explore Effectively"
• arXiv:2504.07912 (2025-04): "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining"
• arXiv:2510.01265 (2025-09): "RLP: Reinforcement as a Pretraining Objective"
• arXiv:2605.28388 (2026-05): "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs"

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above — latent-behavior-only, format collapse, temporal mismatch, out-of-distribution failure, curriculum gains — ask: Have newer model scales, training schedules, multi-agent orchestration, or mechanistic probes shown that these constraints are *relaxable* rather than fundamental? Where does the constraint still hold empirically? Separate the durable question ("what raises the prior?") from the perishable claim ("elicitation alone is sufficient").
(2) SURFACE DISAGREEMENT. Search for work in the last 6 months that contradicts the "prior determines ceiling" framing — e.g., emergent generalization from RL, in-context adaptation without pretraining, or post-hoc capability transfer across domains. State plainly where the library's consensus breaks down.
(3) ASSUME THE REGIME MOVED. Propose two research questions that take as premise: "The ceiling is now higher or conditional on factors the 2024–2026 library did not isolate." Examples: Does on-policy RL during inference (test-time) escape pretraining-era constraints? Can critique-driven self-correction recursively repair out-of-distribution reasoning even after RL convergence?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines