Can the exploration ceiling be raised beyond what pretraining established?
This explores whether post-training methods like RL can push a model's exploration and capability past the ceiling its base/pretraining set — or whether they only re-arrange what was already there.
This explores whether post-training can genuinely raise the exploration ceiling, or whether everything after pretraining just reshuffles latent ability. The corpus splits cleanly into two camps, and the disagreement turns out to be the interesting part. One camp says the ceiling is fixed: RL post-training mostly teaches a model *when* to deploy reasoning it already had, not *how* to reason in new ways — hybrid models recover most of the gains by routing tokens alone, and the activation patterns for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. Worse, RL can actively *lower* the ceiling: optimizing toward a reward collapses behavioral diversity, with search agents converging on narrow strategies through the same entropy-collapse mechanism seen in reasoning Does reinforcement learning squeeze exploration diversity in search agents?.
But the other camp shows the ceiling is not absolute — it's *conditional*. Whether RL creates new capability depends on the task: for standard reasoning it activates latent skills, but for complex multi-step planning it generates genuinely novel strategies the base model can't reach even with unlimited sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. And the conditions matter enormously: prolonged RL with KL control, policy resetting, and non-mathematical tasks beats the base model at *every* pass@k level — the signature of real boundary expansion rather than mere sampling efficiency, especially in domains where the base model never established a pattern to begin with Can reinforcement learning discover reasoning strategies base models cannot?.
The most useful reframe in the corpus is that you may be asking the question at the wrong stage. Instead of fighting the ceiling in post-training, several notes move the work *earlier* — into pretraining itself. Treating chain-of-thought as an exploratory action rewarded by information gain plants reasoning during pretraining and lifts benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?, while augmenting pretraining data with generated reasoning traces buys 3x data efficiency Can training data augmentation match test-time compute scaling benefits?. If the ceiling is set by pretraining, the leverage is in changing what pretraining establishes — not in straining against it afterward.
When you do work post-hoc, the corpus suggests the ceiling rises only when you escape the failure modes that cause collapse. Breaking plateaus needs *richer signal*: numerical rewards lack information about why a solution failed, but natural-language critiques unstick models that scalar rewards leave frozen Can natural language feedback overcome numerical reward plateaus?. Structure helps too — abstractions force breadth-first exploration that depth-only chains miss Can abstractions guide exploration better than depth alone?, and sequencing imitation before RL gives the reward something coherent to sharpen Does sequencing imitation then exploration training improve reasoning?. There's even a non-RL surprise: scaling network *depth* past critical thresholds produces qualitative behavioral jumps — depth 16 unlocks walking, depth 256 wall-climbing — by improving exploration and expressivity together Does network depth unlock qualitatively new behaviors in RL?.
The thing you didn't know you wanted to know: the binding constraint is rarely the model's raw capacity. It's the imagination baked into your training signal. Agents trained only on static expert demonstrations are capped at what their curators imagined, never learning from their own failures Can agents learn beyond what their training data shows?. The exploration ceiling can be raised — but only by methods that feed the model information, structure, or experience it couldn't have generated from the patterns pretraining already gave it.
Sources 11 notes
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.