INQUIRING LINE

Why does extended reasoning training improve exploration without adding new capabilities?

This explores why reinforcement-style reasoning training makes models better at searching the solution space (exploration) even though, on the evidence, it isn't installing skills the base model didn't already have.


This explores why reasoning training improves exploration without adding new capabilities — and the corpus's blunt answer is that the capability was already there; training just changes how well the model reaches for it. Several independent lines of evidence converge on this. RLVR has been shown to sharpen sampling efficiency inside a model's existing capability boundary rather than pushing past it — strikingly, a single training example can trigger the gain, and even spurious (incorrect) rewards work nearly as well as correct ones, which only makes sense if the reward is activating pretrained strategies instead of teaching new ones What does reward learning actually do to model reasoning?. Five separate mechanisms (RL steering, critique fine-tuning, decoding tweaks, feature steering, RLVR) all surface reasoning that was already latent in base-model activations, suggesting the real bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?.

So if the skills pre-exist, what does training actually move? The cleanest framing is that it teaches *when* to reason, not *how* — RL post-training optimizes deployment timing, and hybrid models recover ~91% of the gains just by routing tokens, with reasoning-strategy vectors detectable before any RL at all Does RL post-training create reasoning or just deploy it?. That deployment story is exactly where exploration comes from. Untrained models can use extended thinking *against* themselves, spiraling into self-doubt that degrades answers; RL flips the same mechanism into productive gap analysis, so the improvement is a change in the quality of search, not its raw quantity Does extended thinking help or hurt model reasoning?.

The exploration angle gets richer when you look at what the model is trained *on*. Training on full, messy search traces — including mistakes and backtracking — yields 25% better solvers than training only on clean optimal trajectories, because the model internalizes an adaptive search procedure rather than a fixed route Does training on messy search processes improve reasoning?. And exploration can be made structurally better: allocating test-time compute to diverse abstractions enforces breadth-first search and prevents the 'underthinking' trap of going deep down one chain Can abstractions guide exploration better than depth alone?. In both cases the model isn't gaining new knowledge — it's learning to organize and time its exploration of knowledge it already holds.

This also explains the failure modes, which are the tell that nothing new is being added. More thinking is not monotonically better: pushing thinking tokens from ~1,100 to ~16K dropped accuracy from 87.3% to 70.3%, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. And training that rewards producing reasoning steps never teaches a model when to *stop* — reasoning models will generate long answers to ill-posed questions with missing premises that non-reasoning models correctly reject Why do reasoning models overthink ill-posed questions?. If training were installing genuine new competence rather than tuning exploration policy, you wouldn't expect these clean reversals.

The one nuance worth carrying away: 'no new capabilities' doesn't mean 'order doesn't matter.' A curriculum that does imitation first (to create reasonable rollouts) and exploration-based RL second outperforms either alone — the imitation phase makes the outcome rewards informative enough for the RL phase to sharpen Does sequencing imitation then exploration training improve reasoning?. So the deeper takeaway is that 'better exploration' is really 'better selection and timing of latent abilities,' and the leverage is in shaping the search process — not in stuffing new skills into the weights.


Sources 9 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: does extended reasoning training improve exploration by eliciting latent capabilities, or by installing genuinely new skills? A curated library of arXiv papers (2024–present) claims findings across this terrain:

What a curated library found — and when (dated claims, not current truth):
• Reasoning improvement comes from *when* to reason, not *how* — RL post-training optimizes deployment timing, with reasoning-strategy vectors detectable in base-model activations before any RL (2025-10, arXiv:2510.07364).
• Training on full, messy search traces yields 25% better solvers than clean optimal trajectories, because models internalize adaptive search procedure rather than memorizing fixed routes (2025-05, arXiv:2505.20296).
• More thinking degrades accuracy beyond ~1,100 tokens: pushing to 16K dropped accuracy from 87.3% to 70.3%, showing models overthink easy problems and underthink hard ones (2025-04, arXiv:2505.00127).
• Even spurious (incorrect) rewards work nearly as well as correct ones in RLVR, suggesting reward activates pretrained strategies rather than teaching new ones (2024-02, arXiv:2402.05808).
• Reasoning models lack stopping criterion — they generate long answers to ill-posed questions with missing premises that non-reasoning models correctly reject (2025-07, arXiv:2507.23407).

Anchor papers (verify; mind their dates): arXiv:2510.07364 (Oct 2025), arXiv:2505.20296 (May 2025), arXiv:2505.00127 (Apr 2025), arXiv:2402.05808 (Feb 2024).

Your task:
(1) RE-TEST THE CORE TENSION: Does the 'latent capability' claim hold under newer post-training regimes (e.g., RLP as pretraining objective, arXiv:2510.01265)? Has any recent work shown RL *does* install new reasoning skills rather than routing existing ones? Separate the durable claim (exploration is orchestration) from the perishable one (no new skills acquired); flag what contradicts it.
(2) Surface the strongest work from the last 6 months arguing the *opposite* — that reasoning training does add capability, not merely call it forth.
(3) Propose 2 research questions assuming the regime has shifted: (a) Can curriculum design (imitation→RL) escape the latent-activation frame? (b) Does multi-agent orchestration reveal capabilities inaccessible to single-model extended reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines