INQUIRING LINE

Training, RL, and Test-Time Scaling · Model Architecture and Internals · Reasoning, Retrieval, and Evaluationcross-cluster

How does imitation pretraining followed by RL exploration compare to either method alone?

This explores whether warming up a model by imitating good examples first, then letting it explore via reinforcement learning, beats doing either step on its own — and why the order matters.

This explores whether warming up a model by imitating good examples first, then letting it explore via reinforcement learning, beats doing either step on its own. The corpus is unusually direct here: sequencing the two does beat either alone, and the reason is mechanical. Running an imitation phase first and *then* RL substantially outperforms both methods in isolation, because the imitation phase creates reasonable attempts that make the later reward signal informative — without that foundation, outcome rewards have little to sharpen Does sequencing imitation then exploration training improve reasoning?. The same logic shows up in a controlled study of when RL actually extends reasoning: gains only appear when pretraining has already planted the reasoning primitives and RL targets tasks right at the edge of the model's competence; absent that, RL just refines which answers get sampled rather than teaching anything new When does RL actually extend reasoning beyond pretraining?.

Sources 8 notes

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

When does RL actually extend reasoning beyond pretraining?

A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Why do RL agents exploit before exploring enough?

Task-oriented RL incentivizes premature exploitation of prior knowledge. Training exploration and execution as distinct objectives with separate verifiable rewards yields better downstream performance.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

How does imitation pretraining followed by RL exploration compare to either method alone?

Sources 8 notes

Next inquiring lines