SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The dominant paradigm separates pretraining (next-token prediction) from reasoning (RL post-training with verifiable rewards). RLP challenges this by bringing RL's core mechanism — exploration — into pretraining itself. The key idea: treat chain-of-thought as an exploratory action taken before predicting each next token, with reward computed from the information gain that thought provides.

The reward signal is elegant: measure the increase in log-likelihood of the observed token when conditioning on both context and a sampled reasoning chain, compared to context alone. This is verifier-free (no task-specific checkers needed), dense (assigns credit at every position), and applicable to ordinary web-scale text during pretraining. The model learns to think for itself before predicting what comes next, teaching independent thinking behavior earlier in training.

Results compound: pretraining with RLP on Qwen3-1.7B lifts the average across eight math-and-science benchmarks by 19%. With identical post-training, gains compound further. Applied to Nemotron-Nano-12B, overall average increases from 42.81% to 61.32%. The largest improvements are on reasoning-heavy tasks like AIME25 and MMLU-Pro.

This is significant because it reframes when reasoning should be learned. Since Do base models already contain hidden reasoning ability?, RLP suggests that pretraining itself can plant stronger reasoning seeds. And since Does RL teach reasoning or just when to use it?, RLP may teach the "how" during pretraining, leaving post-training to teach the "when" — a cleaner division of labor.

Unlike prior reinforcement pretraining (RPT) which uses sparse binary rewards and relies on proxy-model filtering, RLP provides continuous improvement signals at every position and trains on full documents, eliminating the need to preselect high-entropy tokens.

Inquiring lines that use this note as a source 75

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 122 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

chain-of-thought as pretraining exploratory action with information-gain reward bridges next-token prediction and reasoning emergence