How does RL refine reasoning paths without simply adding model capability?
This explores a live tension in the corpus: whether reinforcement learning teaches models genuinely new reasoning, or mostly reorganizes and selects reasoning the base model already had.
This explores whether RL actually makes a model smarter, or whether it mostly reshapes how a model uses reasoning it already possesses. The corpus is unusually opinionated here, and it splits into two camps worth knowing about. The dominant finding is that much of what looks like "learning to reason" is really learning *when* to reason. One analysis frames RL post-training as a deployment problem rather than a capability problem — base models already carry reasoning strategies in latent form, and RL mostly optimizes the timing of their use, with hybrid models recovering 91% of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. A companion result on RLVR sharpens this: pass@k analysis shows base models actually *beat* RL-trained ones at high sampling budgets, meaning RL narrows the model toward answers already in its distribution rather than expanding what it can solve Does RLVR actually expand what models can reason about?. The cleanest statement of the thesis is that five independent techniques all elicit the same pre-existing capability — the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?.
Sources 8 notes
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.