SYNTHESIS NOTE

Should training maximize diversity when models feed into search?

If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?

Synthesis note · 2026-05-28 · sourced from Reinforcement Learning

The default post-training objective optimizes a single scalar reward, which pushes the policy toward a low-entropy distribution that concentrates probability on one mode. That is the right behavior if the model answers once and you take what it says. But increasingly the model is a component inside an inference-time search procedure — AlphaEvolve-style evolutionary search, best-of-k sampling, pass@k selection — that draws many rollouts and keeps the best. Here a model that always emits the same near-optimal answer is a liability: search has nothing to select among.

Vector Policy Optimization makes the consequence explicit. The thing the deployment loop actually rewards is not the single best response but the quality of the best response in a set, and the gap between diversity-trained and scalar-trained policies widens as the search budget grows. For evolutionary search the effect is categorical: VPO-trained models solve problems that GRPO-trained models cannot solve at all, because GRPO's collapsed distribution never proposes the seed variation that search needs to mutate from.

Why it matters: it inverts a tacit assumption. We tend to treat entropy reduction as evidence that training worked — the model "knows the answer." But if the model is a generator feeding a selector, sharpness is overfitting to the wrong objective. The post-training target should match the inference-time objective, and when inference is search, that objective is coverage of competent modes. The tension is real: optimizing for set-quality trades away single-shot pass@1, so the choice depends on whether deployment samples once or many times.

Inquiring lines that read this note 22

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

When does optimizing for quality undermine the value of diversity?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

What distinguishes training-time entropy collapse from test-time variance inflation?

How should inference compute be adaptively allocated based on prompt difficulty?

Should test-time search maximize diversity of competent solutions instead of converging on one strategy?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Do harness improvements transfer across model scales or memorize shortcuts?

What feedback signals matter most during harness evolution search?

When do additional thinking tokens stop improving reasoning performance?

What happens when models overthink during test-time search?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 115 in 2-hop network ·medium cluster Open in graph ↗

Should training maximize diversity when models f… Can diversity optimization improve quality during … Does policy entropy collapse limit reasoning perfo… Can evolutionary search beat sampling and revision… Why do reasoning models fail differently at traini… Why does majority voting outperform more complex i…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can diversity optimization improve quality during language model training? Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
converging evidence that diversity-as-objective need not cost quality during training; VPO extends the payoff to inference-time search
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
names the failure VPO routes around: scalar RL collapses entropy, starving downstream search of varied candidates
Can evolutionary search beat sampling and revision at inference time? Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
the deployment regime VPO trains for; evolutionary search is exactly where diversity-trained policies unlock otherwise-unsolvable problems
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
frames the same train/test mismatch from the entropy side; VPO is one resolution that aligns the training objective with test-time sampling
Why does majority voting outperform more complex inference methods? Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
counterpoint on which selector to pair with diverse generation; the value of trained diversity depends on the aggregation method at inference

Should training maximize diversity when models feed into search?

Inquiring lines that read this note 22

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4