SYNTHESIS NOTE

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

The dominant paradigm separates pretraining (next-token prediction) from reasoning (RL post-training with verifiable rewards). RLP challenges this by bringing RL's core mechanism — exploration — into pretraining itself. The key idea: treat chain-of-thought as an exploratory action taken before predicting each next token, with reward computed from the information gain that thought provides.

The reward signal is elegant: measure the increase in log-likelihood of the observed token when conditioning on both context and a sampled reasoning chain, compared to context alone. This is verifier-free (no task-specific checkers needed), dense (assigns credit at every position), and applicable to ordinary web-scale text during pretraining. The model learns to think for itself before predicting what comes next, teaching independent thinking behavior earlier in training.

Results compound: pretraining with RLP on Qwen3-1.7B lifts the average across eight math-and-science benchmarks by 19%. With identical post-training, gains compound further. Applied to Nemotron-Nano-12B, overall average increases from 42.81% to 61.32%. The largest improvements are on reasoning-heavy tasks like AIME25 and MMLU-Pro.

This is significant because it reframes when reasoning should be learned. Since Do base models already contain hidden reasoning ability?, RLP suggests that pretraining itself can plant stronger reasoning seeds. And since Does RL teach reasoning or just when to use it?, RLP may teach the "how" during pretraining, leaving post-training to teach the "when" — a cleaner division of labor.

Unlike prior reinforcement pretraining (RPT) which uses sparse binary rewards and relies on proxy-model filtering, RLP provides continuous improvement signals at every position and trains on full documents, eliminating the need to preselect high-entropy tokens.

Inquiring lines that read this note 80

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do base models contain latent reasoning that training can unlock?

Can prompting inject entirely new knowledge into language models?

Can prompting unlock compositional skills that pretraining already learned?

What actually drives chain-of-thought reasoning improvements in language models?

How do training data properties shape reasoning capability development?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Can extended thinking genuinely improve reasoning or just increase variance?

What determines success in training models on multiple tasks?

Does task ordering affect multi-task reinforcement learning outcomes?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does AI adoption affect human skill development and labor equality?

Does constraining AI access during early task phases preserve skill formation?

Does reinforcement learning teach reasoning or just when to reason?

What properties determine whether reward signals teach genuine reasoning?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Does policy entropy collapse limit how many iterations of reasoning training work?

What capability tradeoffs emerge when scaling model reasoning abilities?

What constrains reinforcement learning's ability to expand model reasoning?

How does the pretrained prior set a capability ceiling for reward model exploration?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Can inference-time compute substitute for scaling up model parameters?

Does inference-time compute improve pretraining data efficiency in practice?

Can next-token prediction alone produce genuine language understanding?

How does latent reasoning compare to verbalized chain-of-thought?

How do self-generated feedback mechanisms enable effective model learning?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How do neural networks separate factual knowledge from reasoning abilities?

Why do knowledge and reasoning train in different network layers?

Can self-supervised signals enable process supervision without human annotation?

How does action-level decomposition differ from token-level imitation in supervision?

How should iterative research systems allocate reasoning per search step?

How does o1-style reasoning relate to learned search processes versus memorized solutions?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does model scale affect anticipatory behavior in structured training?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 122 in 2-hop network ·medium cluster Open in graph ↗

Can chain-of-thought reasoning be learned during… Do base models already contain hidden reasoning ab… Does RL teach reasoning or just when to use it? Can models learn reasoning from predicting any tex… Can adversarial critics replace task-specific veri…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
extends: RLP strengthens the latent reasoning during pretraining itself
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
complements: RLP may teach "how" during pretraining, leaving post-training for "when"
Can models learn reasoning from predicting any text? Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
parallels: both generate internal rationales at token level with self-supervised reward, but RLP operates during pretraining
Can adversarial critics replace task-specific verifiers for reasoning? Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
connects: both achieve verifier-free reasoning training but via different mechanisms

Can chain-of-thought reasoning be learned during pretraining itself?

Inquiring lines that read this note 80

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4