Should training maximize diversity when models feed into search?
If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
The default post-training objective optimizes a single scalar reward, which pushes the policy toward a low-entropy distribution that concentrates probability on one mode. That is the right behavior if the model answers once and you take what it says. But increasingly the model is a component inside an inference-time search procedure — AlphaEvolve-style evolutionary search, best-of-k sampling, pass@k selection — that draws many rollouts and keeps the best. Here a model that always emits the same near-optimal answer is a liability: search has nothing to select among.
Vector Policy Optimization makes the consequence explicit. The thing the deployment loop actually rewards is not the single best response but the quality of the best response in a set, and the gap between diversity-trained and scalar-trained policies widens as the search budget grows. For evolutionary search the effect is categorical: VPO-trained models solve problems that GRPO-trained models cannot solve at all, because GRPO's collapsed distribution never proposes the seed variation that search needs to mutate from.
Why it matters: it inverts a tacit assumption. We tend to treat entropy reduction as evidence that training worked — the model "knows the answer." But if the model is a generator feeding a selector, sharpness is overfitting to the wrong objective. The post-training target should match the inference-time objective, and when inference is search, that objective is coverage of competent modes. The tension is real: optimizing for set-quality trades away single-shot pass@1, so the choice depends on whether deployment samples once or many times.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes external diversity more effective than sequential revision steps?
- Why does island model genetic evolution maintain diversity better than single populations?
- How do you verify whether your context distribution satisfies covariate diversity?
- Why do parallel and sequential test-time search methods produce equivalent results under fixed budgets?
- What conditions make training diversity better than individual expert quality?
- How does mutual shaping through diverse training compare to population-level diversity effects?
- Why does positive reinforcement degrade diversity at higher k values?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- What creates the irreducible trade-off between quality and diversity in training data?
- How does diversity collapse during iterative self-improvement cycles?
- Does critique training improve exploration diversity during model training or only test time?
- Should test-time search maximize diversity of competent solutions instead of converging on one strategy?
- How much does diversity training cost in single-shot pass@1 performance?
- Which aggregation method best exploits diversity in generated solutions?
- How do complexity and diversity affect model performance differently?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
converging evidence that diversity-as-objective need not cost quality during training; VPO extends the payoff to inference-time search
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
names the failure VPO routes around: scalar RL collapses entropy, starving downstream search of varied candidates
-
Can evolutionary search beat sampling and revision at inference time?
Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
the deployment regime VPO trains for; evolutionary search is exactly where diversity-trained policies unlock otherwise-unsolvable problems
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
frames the same train/test mismatch from the entropy side; VPO is one resolution that aligns the training objective with test-time sampling
-
Why does majority voting outperform more complex inference methods?
Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
counterpoint on which selector to pair with diverse generation; the value of trained diversity depends on the aggregation method at inference
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- Outcome-based Exploration for LLM Reasoning
- Jointly Reinforcing Diversity and Quality in Language Model Generations
- Learning to Discover at Test Time
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- Retrieval-augmented reasoning with lean language models
- NoveltyBench: Evaluating Language Models for Humanlike Diversity
Original note title
when models run inside test-time search training should maximize diversity of competent solutions instead of converging on one best answer