How does imitation pretraining followed by RL exploration compare to either method alone?
This explores whether warming up a model by imitating good examples first, then letting it explore via reinforcement learning, beats doing either step on its own — and why the order matters.
This explores whether warming up a model by imitating good examples first, then letting it explore via reinforcement learning, beats doing either step on its own. The corpus is unusually direct here: sequencing the two does beat either alone, and the reason is mechanical. Running an imitation phase first and *then* RL substantially outperforms both methods in isolation, because the imitation phase creates reasonable attempts that make the later reward signal informative — without that foundation, outcome rewards have little to sharpen Does sequencing imitation then exploration training improve reasoning?. The same logic shows up in a controlled study of when RL actually extends reasoning: gains only appear when pretraining has already planted the reasoning primitives and RL targets tasks right at the edge of the model's competence; absent that, RL just refines which answers get sampled rather than teaching anything new When does RL actually extend reasoning beyond pretraining?.
Sources 8 notes
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Task-oriented RL incentivizes premature exploitation of prior knowledge. Training exploration and execution as distinct objectives with separate verifiable rewards yields better downstream performance.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.