INQUIRING LINE

What makes reasoning capability a pre-training rather than post-training phenomenon?

This explores why the raw machinery of reasoning seems to be laid down while a model is learning to predict text (pre-training), with later RL or fine-tuning mostly deciding when and how to switch it on rather than building it from scratch.


This explores why reasoning looks like something a model *already has* coming out of pre-training, with post-training acting more as a thermostat than a foundry. The corpus makes a surprisingly consistent case. The strongest version: base models already carry latent reasoning circuitry, and five unrelated techniques — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — all manage to elicit it from activations that were sitting there before any post-training touched the weights Do base models already contain hidden reasoning ability?. If five different keys open the same door, the door was already built. Post-training selects; it doesn't create.

The sharpest reframing is that RL teaches *when* to reason, not *how*. Activation vectors for reasoning strategies exist before RL runs, and a hybrid model can recover 91% of the performance gains just by routing tokens — deciding when to deploy thinking — rather than learning new thinking Does RL post-training create reasoning or just deploy it?. The same theme shows up from the opposite direction: vanilla models actually have the thinking mechanism but use it counterproductively, drowning in self-doubt, and RL's job is to redirect that existing machinery toward productive gap analysis Does extended thinking help or hurt model reasoning?.

Why does the capability live in pre-training at all? The most concrete answer is about what the model reads. Analysis of five million pre-training documents shows reasoning draws on broad, transferable *procedural* knowledge spread across many sources — the how-to patterns — while factual recall depends on narrow memorization of specific documents Does procedural knowledge drive reasoning more than factual retrieval?. Reasoning is a generalization soaked up from seeing many worked examples, not a fact you can point to. There's even an architectural correlate: knowledge sits in lower network layers, reasoning adjustment in higher ones, which is why reasoning training can boost math yet degrade medicine Why does reasoning training help math but hurt medical tasks?. The substrate is structural, and it's already in place.

If the capability is pre-trained, the natural follow-up is: can we plant *more* of it earlier? RLP treats chain-of-thought as an exploratory action *during* pre-training, rewarding it by how much it improves next-token prediction, and lifts reasoning benchmarks ~19% — evidence that reasoning can be deliberately seeded at the pre-training stage rather than bolted on later Can chain-of-thought reasoning be learned during pretraining itself?. Training models on backward reasoning to sharpen forward reasoning points the same way: deepening understanding at training time, with no test-time cost Can backward reasoning during training improve forward reasoning?. And the elicitation framing has a striking implication — modular cognitive tools pushed GPT-4.1 on AIME from 26.7% to 43.3% with *no* RL training at all, purely by structuring access to what was already there Can modular cognitive tools unlock reasoning without training?.

But the corpus doesn't let pre-training take all the credit, and that's the part worth knowing. The cleanest counterexample: RL's role is domain-conditional. For standard reasoning, RL activates latent ability — consistent with the pre-training story. But for complex multi-step planning, RL generates genuinely novel strategies the base model can't reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. So the honest synthesis isn't "reasoning is purely pre-trained" — it's that the *retrievable, single-shot* kind of reasoning is pre-trained and merely elicited, while deep planning may still require post-training to build something new. The shadow over all of it: chain-of-thought degrades predictably outside its training distribution, imitating the form of reasoning without the logic Does chain-of-thought reasoning actually generalize beyond training data? — which fits a capability that was *absorbed* from pre-training data rather than reasoned into existence.


Sources 10 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can backward reasoning during training improve forward reasoning?

Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher re-testing dated claims about whether reasoning lives in pre-training or emerges from post-training. The question remains open: *what is the actual boundary between latent capability and learned deployment?*

What a curated library found — and when (dated claims, not current truth):
Findings span March 2024–December 2025. The library argues:

• Five unrelated post-training techniques (RL steering, critique fine-tuning, decoding tweaks, sparse autoencoders, RLVR) all elicit reasoning from base models without changing weights, suggesting latent reasoning circuitry pre-exists (~2024–2025).
• Hybrid models recover 91% of RL gains by routing tokens (deciding *when* to reason) rather than learning *how* to reason, reframing RL as deployment scheduling (~2024–2025).
• Reasoning generalizes from *procedural* knowledge spread across five million pre-training documents; factual recall depends on narrow memorization. Reasoning sits in higher network layers, knowledge in lower ones (~2024–11).
• Chain-of-thought as an exploratory action *during* pre-training (RLP) lifts reasoning ~19%, evidence reasoning can be seeded earlier (~2025–09).
• RL's role is domain-conditional: standard reasoning activates latent ability; multi-step planning generates novel strategies base models cannot reach (~2024–10).
• Chain-of-thought degrades predictably outside training distribution, imitating reasoning form without logic (~2025–08).

Anchor papers (verify; mind their dates):
• arXiv:2411.12580 (Nov 2024) — Procedural Knowledge in Pretraining
• arXiv:2510.01265 (Sep 2025) — RLP: Reinforcement as Pretraining Objective
• arXiv:2508.01191 (Aug 2025) — Chain-of-Thought as Mirage (distribution lens)
• arXiv:2512.07783 (Dec 2025) — Interplay of Pre-, Mid-, RL on Reasoning

Your task:

(1) RE-TEST EACH CONSTRAINT. For the 91% routing claim, does it hold under recent scaling (GPT-4o, o1-class models)? For procedural/factual layer separation, has mechanistic interpretability since refined the picture, or do newer architectures (e.g., mixture-of-experts, sparse transformers) blur the boundary? For domain-conditional RL, which planning benchmarks have shifted from latency-extraction to genuine generation in the last 6 months? Separate what still appears true (e.g., distribution-boundedness of CoT) from what newer methods may have relaxed (e.g., test-time scaling with extended reasoning traces).

(2) Surface the strongest *contradicting or superseding* work from the last ~6 months—particularly anything arguing reasoning emerges *purely* from scaling or from mid-training objectives, not pre-training.

(3) Propose 2 research questions that assume the regime may have moved: (a) If reasoning *is* pre-trained, what post-training objective actually deepens it rather than merely scheduling it? (b) How do test-time compute budgets and chain-of-thought length interact with the latent/learned boundary?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines