SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Should training maximize diversity when models feed into search?

If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?

Synthesis note · 2026-05-28 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The default post-training objective optimizes a single scalar reward, which pushes the policy toward a low-entropy distribution that concentrates probability on one mode. That is the right behavior if the model answers once and you take what it says. But increasingly the model is a component inside an inference-time search procedure — AlphaEvolve-style evolutionary search, best-of-k sampling, pass@k selection — that draws many rollouts and keeps the best. Here a model that always emits the same near-optimal answer is a liability: search has nothing to select among.

Vector Policy Optimization makes the consequence explicit. The thing the deployment loop actually rewards is not the single best response but the quality of the best response in a set, and the gap between diversity-trained and scalar-trained policies widens as the search budget grows. For evolutionary search the effect is categorical: VPO-trained models solve problems that GRPO-trained models cannot solve at all, because GRPO's collapsed distribution never proposes the seed variation that search needs to mutate from.

Why it matters: it inverts a tacit assumption. We tend to treat entropy reduction as evidence that training worked — the model "knows the answer." But if the model is a generator feeding a selector, sharpness is overfitting to the wrong objective. The post-training target should match the inference-time objective, and when inference is search, that objective is coverage of competent modes. The tension is real: optimizing for set-quality trades away single-shot pass@1, so the choice depends on whether deployment samples once or many times.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 117 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

when models run inside test-time search training should maximize diversity of competent solutions instead of converging on one best answer