Does RL training actually restore the critical thinking that reasoning models lose?
This reads the question as: does RL post-training genuinely improve a reasoning model's thinking quality, or does extended 'reasoning' actually degrade judgment in ways RL only partly papers over — and the corpus suggests the honest answer is 'RL redirects and deploys thinking that's already there, rather than restoring something lost.'
This explores whether RL training fixes a thinking deficit in reasoning models — and the corpus reframes the question before answering it. The dominant finding isn't that models *lose* critical thinking and RL *restores* it; it's that RL mostly decides *when* to think rather than teaching the model *how*. Several independent lines converge here: base models already carry reasoning strategies in latent form, and minimal training simply elicits them rather than building them (Do base models already contain hidden reasoning ability?, Does RL teach reasoning or just when to use it?). One hybrid setup recovers 91% of the performance gain using only token-routing, which is strong evidence that RL is acting as a deployment optimizer, not a capability creator (Does RL post-training create reasoning or just deploy it?).
But the premise hiding in your question — that reasoning *hurts* — turns out to be real, and that's where RL does something closer to 'restoring.' Vanilla models, when told to think longer, often talk themselves into self-doubt that actively degrades their answers. RL reverses the sign: the same extended-thinking mechanism flips from counterproductive second-guessing into productive gap analysis (Does extended thinking help or hurt model reasoning?). So RL isn't recovering lost critical thinking so much as rehabilitating a mechanism that was misfiring — training mediates the *quality* of reasoning, not just the amount.
The catch is that 'better thinking' and 'better scores' can come apart. RL on theory-of-mind tasks shows scale-dependent collapse: larger models develop genuine, transferable belief-tracking, while smaller ones hit the same accuracy through shortcut learning with no real reasoning underneath — a gap invisible unless you read the step-by-step traces (Does reinforcement learning on theory of mind collapse with model scale?). Similarly, RLVR has been shown to sharpen sampling within a model's existing boundaries rather than push past them — a single example, or even spurious rewards, can trigger the gains, which is hard to square with 'teaching new reasoning' (What does reward learning actually do to model reasoning?). On this view, RL polishes what's already latent and risks rewarding the appearance of thought.
There's a genuine counter-current worth knowing about, though. *Prolonged* RL — with KL control, policy resetting, and training on non-mathematical tasks — outperforms base models across every pass@k level and discovers strategies the base model simply doesn't contain, especially in domains where there's no established pattern to elicit (Can reinforcement learning discover reasoning strategies base models cannot?). And rather than waiting for post-training to repair reasoning, some work plants chain-of-thought during pretraining itself with information-gain rewards, lifting reasoning before any 'loss' can occur (Can chain-of-thought reasoning be learned during pretraining itself?).
So the sharper takeaway: RL doesn't restore lost critical thinking — it redeploys, redirects, and occasionally extends thinking. If you want to go deeper on the mechanics, RL training follows a predictable two-phase arc where execution mastery comes first and strategic planning becomes the later bottleneck (Does RL training follow a predictable two-phase learning sequence?), and you can even reward metacognition directly — tagging planning, exploration, and reflection — to teach efficient reasoning rather than just correct answers (Can RL agents learn to reason better, not just succeed?).
Sources 10 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.