INQUIRING LINE

Why does pre-training provide the raw material for emergent thinking?

This explores why the capacity for reasoning seems to be laid down during pre-training itself — so that later training (RL, fine-tuning) mostly unlocks and steers what's already there rather than creating it from scratch.


This explores why the capacity for reasoning seems to be laid down during pre-training itself — so that later training mostly unlocks and steers what's already there. The corpus converges on a striking claim: base models already contain the ingredients for thinking, and post-training is more of a selection process than a creation one. One line of work finds that five independent methods — RL steering, critique fine-tuning, decoding tweaks, feature steering, and RLVR — all elicit reasoning that was already sitting latent in base model activations Do base models already contain hidden reasoning ability?. The bottleneck, in other words, isn't acquiring the ability — it's eliciting it.

If pre-training is where the raw material lives, what exactly is that material made of? The most concrete answer comes from an analysis of five million pre-training documents: reasoning draws on broad, transferable *procedural* knowledge — the how-to patterns scattered across many sources — rather than the narrow, document-specific memorization that factual recall depends on Does procedural knowledge drive reasoning more than factual retrieval?. Pre-training on diverse text accidentally absorbs the *moves* of reasoning, not just the facts, and those moves generalize. That's the substrate later methods tap into.

The complement to this is a sharp reframing of what RL post-training actually does. Several notes argue it teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains just by routing tokens, and the activation vectors for reasoning strategies exist before any RL touches them Does RL post-training create reasoning or just deploy it?. A formal version frames thinking as selecting among sub-policies a richly-initialized model already holds Does thinking emerge when agents choose between learned sub-policies?. Even training-free methods make the point: modular cognitive tools lifted GPT-4.1 on a hard math benchmark from 27% to 43% with no RL at all, simply by isolating operations the model could already perform Can modular cognitive tools unlock reasoning without training?.

But here's the wrinkle that keeps this from being a tidy story — the picture is domain-conditional. For standard reasoning, RL activates what's latent; for complex multi-step planning, it appears to generate genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. So pre-training provides raw material *up to a point* — deep planning may need capability creation, not just elicitation. And the same latent mechanism can cut both ways: vanilla models often use extended thinking counterproductively, inducing self-doubt, until RL redirects that exact mechanism into productive analysis Does extended thinking help or hurt model reasoning?.

The forward-looking thread asks whether we can stop relying on accident and plant reasoning into pre-training deliberately. RLP treats chain-of-thought as an exploratory action *during* pre-training, rewarding it by how much it improves next-token prediction — lifting reasoning ~19% Can chain-of-thought reasoning be learned during pretraining itself?. A parallel approach augments pre-training data with generated reasoning traces, getting 3x data efficiency by spending more 'thinking' on harder tokens Can training data augmentation match test-time compute scaling benefits?. The thing you didn't know you wanted to know: the field is shifting from treating reasoning as something you *bolt on afterward* to something you can *seed in the foundation* — and the better pre-training gets at laying that groundwork, the less work post-training has to do.


Sources 9 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does thinking emerge when agents choose between learned sub-policies?

Research formalizes thinking as selecting between sub-policies already contained in a policy function through a thought MDP framework. The key finding: thinking doesn't require new reasoning capabilities but rather rich policy initialization combined with RL-driven selection pressure.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The durable question: does pre-training encode reasoning as latent sub-policies that post-training merely selects/routes, or does post-training create genuinely novel reasoning strategies? A curated library (spanning 2024–03 through 2026–05) found — and these are dated claims, not current truth:

• Five independent elicitation methods (RL steering, critique fine-tuning, decoding, feature steering, RLVR) recover latent reasoning already in base activations; post-training appears to be selection, not creation (~2025).
• Procedural knowledge (how-to patterns across diverse documents) is the substrate for reasoning generalization, not factual memorization; this is absorbed during pre-training accident (~2024–11).
• RL post-training teaches *when* to reason, not *how*; hybrid routing models recover 91% of gains without learning reasoning mechanics (~2025).
• Domain-conditional boundary: standard reasoning is latent-activation; multi-step planning may require genuine capability creation beyond base model reach (~2025).
• Vanilla extended thinking often induces self-doubt; RL redirects that same mechanism productively (~2025).
• RLP (chain-of-thought as exploratory action during pre-training, rewarded by next-token prediction gain) lifts reasoning ~19% (~2025–09).
• Thinking-augmented pre-training achieves 3x data efficiency vs. standard pre-training (~2025–09).

Anchor papers (verify; mind their dates):
- arXiv:2411.12580 (2024–11) Procedural Knowledge in Pretraining
- arXiv:2506.17124 (2025–06) Is RL Enough for Thinking?
- arXiv:2510.01265 (2025–09) RLP: Reinforcement as Pretraining Objective
- arXiv:2509.20186 (2025–09) Thinking Augmented Pre-training

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, methods (SFT, IFT, curriculum pre-training), tooling (inference harnesses, sampling strategies), or evals have since RELAXED or OVERTURNED it. Separate the durable question (likely still open) from the perishable claim (possibly resolved); cite what resolved it, and state plainly where latency vs. creation still appears to hold. Pay special attention to the domain-conditional boundary: has the field narrowed or expanded where RL creates vs. elicits?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers arguing that pre-training does *not* contain reasoning, or that post-training is not selection but genuine capability injection.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., if RLP and thinking-augmented pre-training now routinely embed reasoning at foundation, what *post-training* capability gaps remain? Or: if the field has moved toward seeding reasoning in pre-training, what is the new bottleneck for planning, abstraction, or meta-reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines