How does policy entropy during training affect search discipline during inference?
This explores whether the entropy a policy keeps (or loses) during RL training carries over to how broadly a model explores when it searches at inference time — i.e., does training-time collapse make inference-time search narrow and brittle?
This explores whether the entropy a policy keeps (or loses) during RL training shapes how disciplined — or how narrow — a model's search behavior is at inference time. The corpus suggests the link is direct: the exploration breadth you have at deployment is largely the breadth you protected during training, not something the model recovers on its own when given more compute.
The anchor is the finding that policy entropy collapse is the main bottleneck in RL scaling for reasoning Does policy entropy collapse limit reasoning performance in RL?. There's a clean empirical law — performance saturates as entropy approaches zero — because the policy converges onto a few reward-maximizing trajectories and stops trying alternatives. Interventions like Clip-Cov and KL-Cov exist precisely to slow that collapse and keep exploratory capacity alive. The striking part is that this same mechanism shows up in search agents specifically: RL training squeezes exploration diversity in search just as it does in reasoning, with policies narrowing onto a single confident path, while SFT on diverse demonstrations preserves the breadth Does reinforcement learning squeeze exploration diversity in search agents?. So 'search discipline' at inference isn't only a decoding-time setting — it's inherited from how much entropy survived training.
Where the entropy lives matters as much as how much there is. Only about 20% of tokens are high-entropy 'forking points,' and RLVR does most of its useful work by adjusting exactly those pivotal decision tokens Do high-entropy tokens drive reasoning model improvements?. Training on that minority matches full updates. Read alongside the entropy-collapse law, this reframes the problem: collapse isn't uniform — it's the flattening of these specific branch points, and once they flatten, the model stops exploring the alternatives that branch points exist to choose between. The two-phase view sharpens this further: RL training first consolidates execution (which stabilizes its entropy) and only later opens up strategic planning, where planning-token entropy actually *rises* and becomes the new bottleneck Does RL training follow a predictable two-phase learning sequence?. Healthy search discipline, then, looks like low entropy on the mechanical steps and preserved entropy on the strategic forks — not entropy minimized everywhere.
The payoff for the inference side is the finding that training regime beats inference compute budget: a model trained to keep productive exploration outperforms one given unlimited tokens at deployment, because the extra tokens are only useful if the policy knows how to spend them exploring Can non-reasoning models catch up with more compute?. A collapsed policy handed more inference compute just repeats its narrow path more confidently. This connects to the deeper claim that base models already contain latent exploratory reasoning that post-training selects rather than creates Do base models already contain hidden reasoning ability? — meaning entropy collapse during RL can actively *prune away* search behaviors the model was capable of before training narrowed it.
The thing worth taking away: 'search discipline' is a double-edged phrase. You want the model disciplined enough to stop flailing on routine steps, but entropy collapse can over-discipline it — quietly deleting the branch-point diversity that makes inference-time search worth running at all. The corpus frames good training not as minimizing entropy but as managing *where* it collapses.
Sources 6 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.