INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›What pretraining choices and basel…›this inquiring line

Can post-training push an AI past its original limits — or does extra fine-tuning just rearrange abilities already baked in?

Can the exploration ceiling be raised beyond what pretraining established?

This explores whether post-training methods like RL can push a model's exploration and capability past the ceiling its base/pretraining set — or whether they only re-arrange what was already there.

This explores whether post-training can genuinely raise the exploration ceiling, or whether everything after pretraining just reshuffles latent ability. The corpus splits cleanly into two camps, and the disagreement turns out to be the interesting part. One camp says the ceiling is fixed: RL post-training mostly teaches a model *when* to deploy reasoning it already had, not *how* to reason in new ways — hybrid models recover most of the gains by routing tokens alone, and the activation patterns for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. Worse, RL can actively *lower* the ceiling: optimizing toward a reward collapses behavioral diversity, with search agents converging on narrow strategies through the same entropy-collapse mechanism seen in reasoning Does reinforcement learning squeeze exploration diversity in search agents?.

But the other camp shows the ceiling is not absolute — it's *conditional*. Whether RL creates new capability depends on the task: for standard reasoning it activates latent skills, but for complex multi-step planning it generates genuinely novel strategies the base model can't reach even with unlimited sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. And the conditions matter enormously: prolonged RL with KL control, policy resetting, and non-mathematical tasks beats the base model at *every* pass@k level — the signature of real boundary expansion rather than mere sampling efficiency, especially in domains where the base model never established a pattern to begin with Can reinforcement learning discover reasoning strategies base models cannot?.

The most useful reframe in the corpus is that you may be asking the question at the wrong stage. Instead of fighting the ceiling in post-training, several notes move the work *earlier* — into pretraining itself. Treating chain-of-thought as an exploratory action rewarded by information gain plants reasoning during pretraining and lifts benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?, while augmenting pretraining data with generated reasoning traces buys 3x data efficiency Can training data augmentation match test-time compute scaling benefits?. If the ceiling is set by pretraining, the leverage is in changing what pretraining establishes — not in straining against it afterward.

When you do work post-hoc, the corpus suggests the ceiling rises only when you escape the failure modes that cause collapse. Breaking plateaus needs *richer signal*: numerical rewards lack information about why a solution failed, but natural-language critiques unstick models that scalar rewards leave frozen Can natural language feedback overcome numerical reward plateaus?. Structure helps too — abstractions force breadth-first exploration that depth-only chains miss Can abstractions guide exploration better than depth alone?, and sequencing imitation before RL gives the reward something coherent to sharpen Does sequencing imitation then exploration training improve reasoning?. There's even a non-RL surprise: scaling network *depth* past critical thresholds produces qualitative behavioral jumps — depth 16 unlocks walking, depth 256 wall-climbing — by improving exploration and expressivity together Does network depth unlock qualitatively new behaviors in RL?.

The thing you didn't know you wanted to know: the binding constraint is rarely the model's raw capacity. It's the imagination baked into your training signal. Agents trained only on static expert demonstrations are capped at what their curators imagined, never learning from their own failures Can agents learn beyond what their training data shows?. The exploration ceiling can be raised — but only by methods that feed the model information, structure, or experience it couldn't have generated from the patterns pretraining already gave it.

Sources 11 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Show all 11 sources

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher evaluating whether post-training can genuinely expand a language model's exploration ceiling, or whether it merely activates latent capacity set during pretraining. The question remains open despite recent work.

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–May 2026. The corpus splits into two camps:

• One camp: RL post-training teaches *when* to deploy reasoning, not *how* to reason; reward optimization collapses behavioral diversity and can *lower* the ceiling (2025–26).
• Other camp: The ceiling is conditional — RL generates genuinely novel strategies for multi-step planning and non-mathematical tasks, especially under prolonged RL with KL control; this is real boundary expansion, not sampling efficiency (2025).
• Pretraining leverage: Treating chain-of-thought as exploratory action during pretraining lifts benchmarks ~19%; thinking-augmented pretraining buys 3x data efficiency (2025–26).
• Post-hoc escape routes: Natural-language critiques break plateaus scalar rewards leave frozen; reasoning abstractions force breadth-first exploration; sequencing imitation before RL improves outcomes (2025–26).
• Non-RL surprise: Scaling network depth past critical thresholds produces qualitative behavioral jumps — depth 256 enables wall-climbing by improving exploration and expressivity (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.24864 (ProRL, May 2025): Prolonged RL expands reasoning boundaries.
• arXiv:2506.03106 (Critique-GRPO, June 2025): Natural-language feedback breaks plateaus.
• arXiv:2503.14858 (1000 Layer Networks, March 2025): Depth-driven behavioral jumps.
• arXiv:2605.22817 (Vector Policy Optimization, May 2026): Diversity training improves search.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the "ceiling is fixed" claim, investigate whether newer reasoning models, multi-agent orchestration, or improved evaluation harnesses have since contradicted it. For the "conditional ceiling" claim, check whether scaling (model size, compute, data) or architectural changes have unified or further fragmented the task-dependency. Separately state which constraints appear durable (still hold in 2026 models) and which have relaxed.
(2) Surface the strongest contradicting work from the last ~6 months — particularly any paper showing genuine capability creation even in standard reasoning, or showing that depth/scale alone dissolves the imitation–RL pipeline necessity.
(3) Propose 2 research questions that assume the regime has moved: one exploring whether exploration ceilings are *task-geometry-dependent* rather than task-category-dependent; one testing whether orchestration (memory, multi-agent debate, synthetic data generation) can raise ceilings *without* new pretraining.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can post-training push an AI past its original limits — or does extra fine-tuning just rearrange abilities already baked in?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8