How does RPT compare to learning when versus how to deploy reasoning?
This explores a recent claim about RL post-training (RPT): that it teaches models *when* to deploy reasoning they already have, rather than teaching them *how* to reason in the first place — and what the corpus says for and against that split.
Read literally, the question lands on one of the most interesting reframings in the collection: that RL post-training (RPT) isn't creating reasoning ability at all — it's learning *when* to switch it on. The clearest statement of this is the finding that base models already carry reasoning strategies in latent form, and RL mostly optimizes deployment timing. Hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies exist *before* any RL touches the model Does RL post-training create reasoning or just deploy it?. So the comparison the question asks for — RPT vs. "learning when vs. how" — is really the same debate viewed twice: the deployment view says RPT is a *when* mechanism, not a *how* mechanism.
Several notes converge on this from different angles. One shows that reward-based training (RLVR) improves sampling efficiency *within* a model's existing capability boundary without expanding it — a single example can suffice to activate behavior, and even spurious or random rewards work nearly as well as correct ones if pretraining already laid the groundwork What does reward learning actually do to model reasoning?. Another finds the activation can be genuine even when the benchmark gains are partly memorization — "behavioral activation" and "benchmark improvement" turn out to be separable phenomena measured at different levels Can genuine reasoning activation coexist with contaminated benchmarks?, and on clean, uncontaminated benchmarks only correct rewards help, exposing how much apparent "reasoning" was dataset leakage Does RLVR success on math benchmarks reflect genuine reasoning improvement?. All of this supports the "when, not how" picture: training is surfacing and timing pre-existing capability, not minting new reasoning.
The sharpest lateral support comes from work showing reasoning gains are about *format*, not knowledge. A 1.5B model with LoRA-only post-training matched far larger full-RL models, implying RL teaches output *organization* rather than new facts — reasoning and knowledge storage are separable Can small models reason well by just learning output format?. The chain-of-thought critiques push the same blade further: logically *invalid* CoT exemplars perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, CoT degrades predictably outside its training distribution Does chain-of-thought reasoning actually generalize beyond training data?, and what looks like inference is better described as constrained imitation of reasoning *form* Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. If models are reproducing the shape of reasoning rather than the substance, then "how to reason" was never what training was teaching — which is exactly the deployment thesis.
But the corpus doesn't let the "when, not how" story win cleanly. The SRL-then-RLVR curriculum shows that an imitation phase *first* — establishing reasoning foundations — makes the later reward phase informative, and the combination beats either alone Does sequencing imitation then exploration training improve reasoning?. That's a "how" contribution that a pure deployment view underrates: you sometimes have to build the rollouts before timing can be sharpened. And once reasoning is deployed, *how* you spend the budget matters less than people think — framework choice (BoN vs. MCTS) washes out once you control for total compute and reward-function quality Does the choice of reasoning framework actually matter for test-time performance?, while routing queries to task-matched knowledge structures *does* help Can routing queries to task-matched structures improve RAG reasoning?.
The thing you didn't know you wanted to know: "when vs. how" isn't a tidy binary. The strongest reading the corpus offers is that RPT is mostly a *when* (deployment-timing) and *format* mechanism operating on capabilities pretraining already installed — but curriculum order shows a real "how" phase still has to come first for the "when" to mean anything.
Sources 12 notes
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.