INQUIRING LINE

How does RL refine reasoning paths without simply adding model capability?

This explores a live tension in the corpus: whether reinforcement learning teaches models genuinely new reasoning, or mostly reorganizes and selects reasoning the base model already had.


This explores whether RL actually makes a model smarter, or whether it mostly reshapes how a model uses reasoning it already possesses. The corpus is unusually opinionated here, and it splits into two camps worth knowing about. The dominant finding is that much of what looks like "learning to reason" is really learning *when* to reason. One analysis frames RL post-training as a deployment problem rather than a capability problem — base models already carry reasoning strategies in latent form, and RL mostly optimizes the timing of their use, with hybrid models recovering 91% of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. A companion result on RLVR sharpens this: pass@k analysis shows base models actually *beat* RL-trained ones at high sampling budgets, meaning RL narrows the model toward answers already in its distribution rather than expanding what it can solve Does RLVR actually expand what models can reason about?. The cleanest statement of the thesis is that five independent techniques all elicit the same pre-existing capability — the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?.


Sources 8 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question: does RL refine *reasoning paths* without expanding underlying model capability, or does the regime of what RL can do shift under newer models, training methods, or eval harnesses?

What a curated library found — and when (dated claims, not current truth):
Findings span Oct 2024–May 2026. A dominant thesis emerged:
• RL post-training optimizes *when* to reason, not *how*; base models already carry reasoning strategies latently (2025-10).
• Routing tokens alone recovers 91% of RL gains, implying capability was pre-existing (2025-10).
• pass@k analysis: base models outperform RL-trained ones at high sampling budgets, suggesting RL narrows rather than expands solution distribution (2025-04).
• Five independent elicitation techniques recover the same pre-existing capability; bottleneck is discovery, not acquisition (2025-04).
• Some tension: ProRL (May 2025) and RLAD (Oct 2025) claim RL *does* expand boundaries under prolonged training or abstraction discovery.

Anchor papers (verify; mind their dates):
• arXiv:2510.07364 (Oct 2025) – "Base Models Know How to Reason, Thinking Models Learn When"
• arXiv:2504.13837 (Apr 2025) – "Does Reinforcement Learning Really Incentivize Reasoning Capacity..."
• arXiv:2505.24864 (May 2025) – "ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries"
• arXiv:2510.02263 (Oct 2025) – "RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems"

Your task:
(1) RE-TEST THE "ELICITATION NOT ACQUISITION" CLAIM. Does it hold under the latest base models (e.g., o4 or equivalent), longer training horizons (ProRL, RLAD), or multi-agent orchestration (e.g., recursive reasoning agents with memory)? Which newer results contradict the 2025-10 consensus, and on what evidence?
(2) Surface the strongest *disagreement*: ProRL and RLAD both claim boundary expansion. Ground the conflict — do they use different eval benchmarks, model scales, or reward signals that explain the split?
(3) Propose two durable research questions assuming RL *does* reshape reasoning: (a) Can abstraction discovery (RLAD) + prolonged training (ProRL) together cross the elicitation ceiling? (b) Does recursive or multi-agent RL, which wasn't studied in this cohort, unlock new reasoning paths?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines