SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Outcome-based RL (rewarding only final answer correctness) produces substantial accuracy gains but systematically reduces generation diversity. This is known. What is new: the diversity loss transfers across problems. Concentrating probability mass on correct answers for solved problems propagates to unsolved problems — the model's entire output distribution narrows, not just its distribution on problems it can solve.

The transfer mechanism: RL sharpens the policy globally, not per-problem. When the model learns to concentrate on correct trajectories for problems it has solved, the reduced diversity in its generative distribution also manifests as reduced diversity on problems it has not solved. This means RL can reduce effective diversity even on the training set relative to the base model.

The practical consequence: diversity is critical for test-time scaling. Since Why does parallel reasoning outperform single chain thinking?, diverse parallel samples are more valuable than many copies of similar reasoning. And since Why does majority voting outperform more complex inference methods?, voting requires genuine diversity to work — voting over near-identical samples provides no signal.

The key conceptual contribution is distinguishing two forms of exploration:

These require different mechanisms. Historical exploration uses UCB-style bonuses over outcome space (tractable because reasoning tasks have a limited set of distinct final answers). Batch exploration uses within-batch repetition penalties. The distinction directly instantiates Why do reasoning models fail differently at training versus inference? — historical/batch exploration maps onto training-time/test-time with concrete algorithmic prescriptions.

Inquiring lines that use this note as a source 38

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 145 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

outcome-based rl induces diversity loss that transfers from solved to unsolved problems — historical and batch exploration require separate mechanisms