INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›Does reinforcement learning teach…›this inquiring line

AI models may already know how to reason — reinforcement learning just teaches them when to switch it on.

How does RPT compare to learning when versus how to deploy reasoning?

This explores a recent claim about RL post-training (RPT): that it teaches models *when* to deploy reasoning they already have, rather than teaching them *how* to reason in the first place — and what the corpus says for and against that split.

Read literally, the question lands on one of the most interesting reframings in the collection: that RL post-training (RPT) isn't creating reasoning ability at all — it's learning *when* to switch it on. The clearest statement of this is the finding that base models already carry reasoning strategies in latent form, and RL mostly optimizes deployment timing. Hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies exist *before* any RL touches the model Does RL post-training create reasoning or just deploy it?. So the comparison the question asks for — RPT vs. "learning when vs. how" — is really the same debate viewed twice: the deployment view says RPT is a *when* mechanism, not a *how* mechanism.

Several notes converge on this from different angles. One shows that reward-based training (RLVR) improves sampling efficiency *within* a model's existing capability boundary without expanding it — a single example can suffice to activate behavior, and even spurious or random rewards work nearly as well as correct ones if pretraining already laid the groundwork What does reward learning actually do to model reasoning?. Another finds the activation can be genuine even when the benchmark gains are partly memorization — "behavioral activation" and "benchmark improvement" turn out to be separable phenomena measured at different levels Can genuine reasoning activation coexist with contaminated benchmarks?, and on clean, uncontaminated benchmarks only correct rewards help, exposing how much apparent "reasoning" was dataset leakage Does RLVR success on math benchmarks reflect genuine reasoning improvement?. All of this supports the "when, not how" picture: training is surfacing and timing pre-existing capability, not minting new reasoning.

The sharpest lateral support comes from work showing reasoning gains are about *format*, not knowledge. A 1.5B model with LoRA-only post-training matched far larger full-RL models, implying RL teaches output *organization* rather than new facts — reasoning and knowledge storage are separable Can small models reason well by just learning output format?. The chain-of-thought critiques push the same blade further: logically *invalid* CoT exemplars perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, CoT degrades predictably outside its training distribution Does chain-of-thought reasoning actually generalize beyond training data?, and what looks like inference is better described as constrained imitation of reasoning *form* Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. If models are reproducing the shape of reasoning rather than the substance, then "how to reason" was never what training was teaching — which is exactly the deployment thesis.

But the corpus doesn't let the "when, not how" story win cleanly. The SRL-then-RLVR curriculum shows that an imitation phase *first* — establishing reasoning foundations — makes the later reward phase informative, and the combination beats either alone Does sequencing imitation then exploration training improve reasoning?. That's a "how" contribution that a pure deployment view underrates: you sometimes have to build the rollouts before timing can be sharpened. And once reasoning is deployed, *how* you spend the budget matters less than people think — framework choice (BoN vs. MCTS) washes out once you control for total compute and reward-function quality Does the choice of reasoning framework actually matter for test-time performance?, while routing queries to task-matched knowledge structures *does* help Can routing queries to task-matched structures improve RAG reasoning?.

The thing you didn't know you wanted to know: "when vs. how" isn't a tidy binary. The strongest reading the corpus offers is that RPT is mostly a *when* (deployment-timing) and *format* mechanism operating on capabilities pretraining already installed — but curriculum order shows a real "how" phase still has to come first for the "when" to mean anything.

Sources 12 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Show all 12 sources

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Invisible Leash: Why RLVR May Not Escape Its Origin4.26 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens3.57 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective3.57 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs3.52 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR3.44 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools3.42 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?3.35 match · arxiv ↗
Hierarchical Reasoning Model2.64 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing whether RL post-training (RPT) teaches reasoning *when* to deploy it versus *how* to reason. A curated library (spanning 2023–2025, arXiv focus) found:

**What a curated library found — and when (dated claims, not current truth):**
- Base models embed reasoning strategies latently; RL optimizes *deployment timing*, not reasoning invention. Hybrid models recover 91% of gains via token routing alone (2025).
- Reward-based training improves sampling efficiency within pre-existing capability bounds; spurious rewards work nearly as well as correct ones when pretraining is sufficient (2025).
- Behavioral activation and benchmark improvement are separable; on contamination-free benchmarks, only correct rewards help, exposing dataset leakage in apparent "reasoning" (2025).
- Reasoning gains reflect *format* organization, not knowledge: 1.5B LoRA-only models match full-RL baselines, and logically invalid CoT performs near-identically to valid CoT (2025).
- SRL-then-RLVR curriculum shows imitation-first establishes reasoning foundations, making reward phase informative—a "how" contribution pure deployment views underrate (2024).

**Anchor papers (verify; mind their dates):**
- arXiv:2506.02878 (2025-06): CoT as constrained imitation, not genuine reasoning.
- arXiv:2507.10532 (2025-07): RLVR gains partly memorization; clean benchmarks disambiguate.
- arXiv:2504.15777 (2025-04): LoRA-only reasoning competitive with full RL.
- arXiv:2402.05808 (2024-02): Reverse curriculum learning (SRL→RLVR) outperforms either alone.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, check whether newer training methods (e.g., constitutional AI, outcome-supervised RL), model scales, or evaluation harnesses have relaxed the "when ≫ how" claim. Separate the durable question (does RPT invent reasoning or surface it?) from perishable constraints (e.g., does 91% routing recovery still hold at 70B+ scale?). Cite what changed it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months** that argues RPT genuinely teaches new reasoning strategies rather than timing or format.
(3) **Propose 2 research questions** that assume the regime may have shifted—e.g., whether multi-stage training or ensemble methods now blur the when/how boundary, or whether reasoning gains at scale behave differently than at 1.5B.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI models may already know how to reason — reinforcement learning just teaches them when to switch it on.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8