What makes reasoning capability a pre-training rather than post-training phenomenon?
This explores why the raw machinery of reasoning seems to be laid down while a model is learning to predict text (pre-training), with later RL or fine-tuning mostly deciding when and how to switch it on rather than building it from scratch.
This explores why reasoning looks like something a model *already has* coming out of pre-training, with post-training acting more as a thermostat than a foundry. The corpus makes a surprisingly consistent case. The strongest version: base models already carry latent reasoning circuitry, and five unrelated techniques — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — all manage to elicit it from activations that were sitting there before any post-training touched the weights Do base models already contain hidden reasoning ability?. If five different keys open the same door, the door was already built. Post-training selects; it doesn't create.
The sharpest reframing is that RL teaches *when* to reason, not *how*. Activation vectors for reasoning strategies exist before RL runs, and a hybrid model can recover 91% of the performance gains just by routing tokens — deciding when to deploy thinking — rather than learning new thinking Does RL post-training create reasoning or just deploy it?. The same theme shows up from the opposite direction: vanilla models actually have the thinking mechanism but use it counterproductively, drowning in self-doubt, and RL's job is to redirect that existing machinery toward productive gap analysis Does extended thinking help or hurt model reasoning?.
Why does the capability live in pre-training at all? The most concrete answer is about what the model reads. Analysis of five million pre-training documents shows reasoning draws on broad, transferable *procedural* knowledge spread across many sources — the how-to patterns — while factual recall depends on narrow memorization of specific documents Does procedural knowledge drive reasoning more than factual retrieval?. Reasoning is a generalization soaked up from seeing many worked examples, not a fact you can point to. There's even an architectural correlate: knowledge sits in lower network layers, reasoning adjustment in higher ones, which is why reasoning training can boost math yet degrade medicine Why does reasoning training help math but hurt medical tasks?. The substrate is structural, and it's already in place.
If the capability is pre-trained, the natural follow-up is: can we plant *more* of it earlier? RLP treats chain-of-thought as an exploratory action *during* pre-training, rewarding it by how much it improves next-token prediction, and lifts reasoning benchmarks ~19% — evidence that reasoning can be deliberately seeded at the pre-training stage rather than bolted on later Can chain-of-thought reasoning be learned during pretraining itself?. Training models on backward reasoning to sharpen forward reasoning points the same way: deepening understanding at training time, with no test-time cost Can backward reasoning during training improve forward reasoning?. And the elicitation framing has a striking implication — modular cognitive tools pushed GPT-4.1 on AIME from 26.7% to 43.3% with *no* RL training at all, purely by structuring access to what was already there Can modular cognitive tools unlock reasoning without training?.
But the corpus doesn't let pre-training take all the credit, and that's the part worth knowing. The cleanest counterexample: RL's role is domain-conditional. For standard reasoning, RL activates latent ability — consistent with the pre-training story. But for complex multi-step planning, RL generates genuinely novel strategies the base model can't reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. So the honest synthesis isn't "reasoning is purely pre-trained" — it's that the *retrievable, single-shot* kind of reasoning is pre-trained and merely elicited, while deep planning may still require post-training to build something new. The shadow over all of it: chain-of-thought degrades predictably outside its training distribution, imitating the form of reasoning without the logic Does chain-of-thought reasoning actually generalize beyond training data? — which fits a capability that was *absorbed* from pre-training data rather than reasoned into existence.
Sources 10 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.