Can chain-of-thought reasoning be learned during pretraining itself?
Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.
The dominant paradigm separates pretraining (next-token prediction) from reasoning (RL post-training with verifiable rewards). RLP challenges this by bringing RL's core mechanism — exploration — into pretraining itself. The key idea: treat chain-of-thought as an exploratory action taken before predicting each next token, with reward computed from the information gain that thought provides.
The reward signal is elegant: measure the increase in log-likelihood of the observed token when conditioning on both context and a sampled reasoning chain, compared to context alone. This is verifier-free (no task-specific checkers needed), dense (assigns credit at every position), and applicable to ordinary web-scale text during pretraining. The model learns to think for itself before predicting what comes next, teaching independent thinking behavior earlier in training.
Results compound: pretraining with RLP on Qwen3-1.7B lifts the average across eight math-and-science benchmarks by 19%. With identical post-training, gains compound further. Applied to Nemotron-Nano-12B, overall average increases from 42.81% to 61.32%. The largest improvements are on reasoning-heavy tasks like AIME25 and MMLU-Pro.
This is significant because it reframes when reasoning should be learned. Since Do base models already contain hidden reasoning ability?, RLP suggests that pretraining itself can plant stronger reasoning seeds. And since Does RL teach reasoning or just when to use it?, RLP may teach the "how" during pretraining, leaving post-training to teach the "when" — a cleaner division of labor.
Unlike prior reinforcement pretraining (RPT) which uses sparse binary rewards and relies on proxy-model filtering, RLP provides continuous improvement signals at every position and trains on full documents, eliminating the need to preselect high-entropy tokens.
Inquiring lines that use this note as a source 75
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes reasoning capability a pre-training rather than post-training phenomenon?
- Can prompting unlock compositional skills that pretraining already learned?
- Why do chain-of-thought prompts work if reasoning is not systematic?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- How much does pre-training frequency predict reasoning task performance?
- Can extended thinking genuinely improve reasoning or just increase variance?
- Why does chain-of-thought fail when problems lack matching training schemata?
- How much does pretraining contribute to ToM performance versus task-specific training?
- Does task ordering affect multi-task reinforcement learning outcomes?
- Why does early experience provide better warm-starts for downstream reinforcement learning?
- Does constraining AI access during early task phases preserve skill formation?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- Can reinforcement learning add missing domain knowledge to fine-tuned reasoning models?
- Can RL teach when to use reasoning versus when to respond directly?
- How does chain-of-thought training change higher layer computations?
- How do probability-based rewards compare to self-consistency as training signals for reasoning?
- Why do models learn reasoning form instead of actual abstract inference?
- Can chain-of-thought reasoning be genuinely causal if exemplars don't need logic?
- Does reinforcement learning learn optimal per-turn reasoning discipline?
- Do chain-of-thought explanations reveal genuine reasoning or trigger latent features?
- Does policy entropy collapse limit how many iterations of reasoning training work?
- Why does imitation learning create a ceiling for reasoning capability?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- How does the pretrained prior set a capability ceiling for reward model exploration?
- How does reinforcement learning differ from chain-of-thought distillation?
- Can RL training teach models when to activate reasoning versus when to skip it?
- How do reasoning training methods sacrifice some thinking skills while improving others?
- Can random rewards improve reasoning models if pretraining is suitable?
- How does a single training example trigger phase transitions in reasoning output?
- Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?
- Can models learn both what and how to study through reinforcement learning?
- Why do SFT models memorize patterns instead of learning generalizable reasoning?
- How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?
- Does reinforcement learning preserve reasoning quality better than supervised fine-tuning?
- Does RL training actually restore the critical thinking that reasoning models lose?
- Does inference-time compute improve pretraining data efficiency in practice?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Does format-based pretraining determine how models respond to reinforcement learning?
- How does policy initialization with sub-policies enable emergent thinking?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- Can training models on backward reasoning improve their forward planning ability?
- What is the distinction between teaching reasoning how versus when to activate?
- Can pretraining signals unlock latent reasoning that post-training merely activates?
- Why does eliminating proxy-model filtering improve reasoning emergence in pretraining?
- What distinguishes reasoning activation mechanisms across different training methods?
- How does backward reasoning during training improve forward reasoning capability?
- Why does self-segmentation into chunks-of-thought matter for reward models?
- Does reinforcement learning teach models how to reason or when to reason?
- Why might chain-of-thought reasoning bypass action selection pathways?
- Does next-token prediction actually explain how human thought works?
- How do timing and search internalization interact during reasoning post-training?
- Can the exploration ceiling be raised beyond what pretraining established?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- What training interventions could close the perception-action gap?
- What happens to representational structure during model pretraining phases?
- Does token-level reasoning during pretraining improve general reasoning without task-specific supervision?
- Why do knowledge and reasoning train in different network layers?
- How does action-level decomposition differ from token-level imitation in supervision?
- Does the token prediction framing actually capture what human reasoning does?
- Why does pre-training provide the raw material for emergent thinking?
- How do thought actions represent policy improvement steps in practice?
- What role does task structure play in rewarding delayed thinking?
- Can we predict when a model will develop thinking behaviors?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Why does the pretrained prior determine the exploration ceiling?
- Can models learn to optimize their own chain-of-thought generation?
- What makes some bottlenecks invisible to chain-of-thought training?
- Can RL create new reasoning primitives that pretraining never established?
- When does reinforcement learning actually produce true reasoning gains in models?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?
- How does model scale affect anticipatory behavior in structured training?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- What makes content informative and not-yet-mastered for reinforcement during pretraining?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
extends: RLP strengthens the latent reasoning during pretraining itself
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
complements: RLP may teach "how" during pretraining, leaving post-training for "when"
-
Can models learn reasoning from predicting any text?
Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
parallels: both generate internal rationales at token level with self-supervised reward, but RLP operates during pretraining
-
Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
connects: both achieve verifier-free reasoning training but via different mechanisms
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RLP: Reinforcement as a Pretraining Objective
- Base Models Know How to Reason, Thinking Models Learn When
- Understanding and Mitigating Premature Confidence for Better LLM Reasoning
- Eliciting Reasoning in Language Models with Cognitive Tools
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- When More is Less: Understanding Chain-of-Thought Length in LLMs
Original note title
chain-of-thought as pretraining exploratory action with information-gain reward bridges next-token prediction and reasoning emergence