INQUIRING LINE

Can RL teach when to use reasoning versus when to respond directly?

This explores whether reinforcement learning (RL) actually teaches a model the *judgment* of when to think step-by-step versus answer immediately — rather than teaching it to reason better in the first place.


This explores whether RL teaches a model *when* to deploy reasoning versus respond directly — and the corpus comes down strongly on the side of yes, with an important twist: that timing-and-routing skill may be most of what RL adds. Several notes converge on the idea that the raw reasoning capability already lives in the base model after pre-training, and what RL really optimizes is *deployment*. One striking demonstration combines a base model's reasoning with a thinking model's steering and recovers 91% of the performance gains using only 12% of the tokens — strong evidence that RL acts as a deployment optimizer rather than a capability creator Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just deploy it?. The reasoning 'strategy vectors' even pre-exist before any RL touches the model, which reframes RL training as learning a routing policy over latent abilities.

The most direct answer to your question is a method that trains exactly that switch. Thinkless uses a decoupled RL scheme (DeGRPO) to train a single model to pick between extended thinking and a fast direct response, and crucially it does this *without* being told in advance which questions are hard — it self-calibrates difficulty. The decoupling matters: separating 'which mode' from 'how to answer' is what stops the model from collapsing into always-think or always-skip Can models learn when to think versus respond quickly?. So 'when to reason' isn't just an emergent side effect — it can be made the explicit training target.

There's a complementary clue from prompting research that explains *why* a routing skill helps: step-by-step reasoning isn't always beneficial. For simple questions, letting the question flow directly to the answer outperforms chain-of-thought, because forcing reasoning can actually get in the way Why do some questions perform better without step-by-step reasoning?. If reasoning sometimes *hurts*, then learning when to suppress it is a real capability, not a trivial one — which is the lateral payoff of your question.

The honest tension in the corpus is whether RL *only* does timing. The 'deployment optimizer' camp is reinforced by work showing RL improves sampling efficiency within existing boundaries — spurious or even single-example rewards can activate pretrained strategies, and base models can match RL models at high sampling budgets What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. But the opposing camp shows that *prolonged* RL on diverse, non-mathematical tasks discovers genuinely novel strategies the base model can't reach, beating it at every sampling level Can reinforcement learning discover reasoning strategies base models cannot?. A two-phase view partly reconciles this: RL first consolidates execution, then shifts the bottleneck to strategic planning — and the second phase is exactly where 'when and how to deploy reasoning' gets learned Does RL training follow a predictable two-phase learning sequence?.

If you want to follow the thread further, the corpus also has approaches that fold the same when-to-reason judgment into different stages of training: meta-reasoning rewards that teach agents to plan, explore, and reflect efficiently and cut repetitive actions by 31% Can RL agents learn to reason better, not just succeed?; curricula that imitate first and then sharpen with RL Does sequencing imitation then exploration training improve reasoning?; and even pushing reasoning all the way back into pre-training itself Can chain-of-thought reasoning be learned during pretraining itself?. The takeaway you might not have expected: 'when to reason' is emerging as a first-class skill that can be trained, routed, and measured on its own — not just a byproduct of teaching a model to think.


Sources 11 notes

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether RL teaches routing-to-reasoning as a first-class skill, or merely optimizes deployment of pre-existing reasoning. The question remains: *Can RL teach when to use reasoning versus when to respond directly?*

What a curated library found — and when (findings span 2023–12/2025, dated claims, not current truth):
• Base models contain pre-trained reasoning 'strategy vectors'; RL acts as a routing optimizer, recovering 91% of thinking-model gains using 12% of tokens (2025).
• Thinkless (DeGRPO) trains explicit timing: a single model learns to pick extended thinking vs. fast response *without* being told which questions are hard, self-calibrating difficulty (2025).
• Step-by-step reasoning *hurts* on simple questions; learning when to suppress reasoning is a real capability, not trivial (2024).
• Tension: RL improves sampling efficiency within base-model boundaries (spurious rewards activate pre-trained strategies); *prolonged* RL discovers genuinely novel strategies unreachable by the base model (2025).
• Two-phase RL dynamics: procedural consolidation precedes strategic planning, where 'when and how to deploy reasoning' emerges (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.13379 (Thinkless, 2025-05): Decoupled RL for routing extended thinking.
• arXiv:2505.24864 (ProRL, 2025-05): Prolonged RL discovers novel strategies.
• arXiv:2507.22844 (RLVMR, 2025-07): Meta-reasoning rewards; cuts repetitive actions 31%.
• arXiv:2512.07783 (Interplay, 2025-12): Pre-training, mid-training, RL on reasoning models.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 91%-with-12%-tokens claim and the "routing as deployment" framing: has post-12/2025 work shown that RL actually *expands* reasoning capability (not merely routes it), or does the pre-trained-vectors thesis still hold? Does Thinkless-style decoupling scale beyond toy setups, and does self-calibration of difficulty remain stable? Cite what revised or upheld these claims.
(2) Surface the strongest *contradicting* work from the last 6 months — any paper arguing RL fundamentally *creates* reasoning capacity, not just times its use, and whether it outweighs the routing-optimizer view.
(3) Propose 2 research questions that assume the regime *has* moved: (a) If routing is now solved, what is RL's next bottleneck — strategic novelty, generalization to unseen task classes, or calibration under distribution shift? (b) Can a unified metric measure 'when-to-reason' skill across models, or does each routing strategy remain task-specific?

Next inquiring lines