Does outcome-based RL diversity loss spread across unsolved problems?
When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
Outcome-based RL (rewarding only final answer correctness) produces substantial accuracy gains but systematically reduces generation diversity. This is known. What is new: the diversity loss transfers across problems. Concentrating probability mass on correct answers for solved problems propagates to unsolved problems — the model's entire output distribution narrows, not just its distribution on problems it can solve.
The transfer mechanism: RL sharpens the policy globally, not per-problem. When the model learns to concentrate on correct trajectories for problems it has solved, the reduced diversity in its generative distribution also manifests as reduced diversity on problems it has not solved. This means RL can reduce effective diversity even on the training set relative to the base model.
The practical consequence: diversity is critical for test-time scaling. Since Why does parallel reasoning outperform single chain thinking?, diverse parallel samples are more valuable than many copies of similar reasoning. And since Why does majority voting outperform more complex inference methods?, voting requires genuine diversity to work — voting over near-identical samples provides no signal.
The key conceptual contribution is distinguishing two forms of exploration:
- Historical exploration — visiting diverse states and actions during training. Improves pass@1 (single-attempt accuracy) because the model encounters more training signal. Does not guarantee test-time diversity.
- Batch exploration — producing diverse outputs at test time. Improves pass@k (k-attempt coverage) because outputs span more of the solution space. Does not improve training diversity.
These require different mechanisms. Historical exploration uses UCB-style bonuses over outcome space (tractable because reasoning tasks have a limited set of distinct final answers). Batch exploration uses within-batch repetition penalties. The distinction directly instantiates Why do reasoning models fail differently at training versus inference? — historical/batch exploration maps onto training-time/test-time with concrete algorithmic prescriptions.
Inquiring lines that use this note as a source 38
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does RLVR increase token entropy while decreasing answer diversity?
- Can population diversity in self-improvement prevent error avalanching failures?
- Why do evolutionary algorithms collapse to single solutions under selection pressure?
- Why does AI output show diversity without multiplying actual points of view?
- Why does island model genetic evolution maintain diversity better than single populations?
- How does forced exploration through diversity rewards differ from suppression-based negative reinforcement?
- How do you verify whether your context distribution satisfies covariate diversity?
- How does covariate diversity compare to the exploration assumptions of LinUCB?
- What conditions make training diversity better than individual expert quality?
- How does training data distribution determine what models can learn?
- Why does positive reinforcement degrade diversity at higher k values?
- Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?
- How does majority voting fail when reasoning samples lack genuine diversity?
- Can diversity-aware RL objectives prevent format convergence?
- What creates the irreducible trade-off between quality and diversity in training data?
- How does diversity loss in synthetic data mirror tail distribution disappearance?
- How does RL compress reasoning path diversity during training?
- How do RL subnetworks identified from different random seeds compare?
- How does KL penalty strength affect the degree of format collapse during RL?
- Is distribution selection during RL the same compression mechanism as entropy collapse?
- How does diversity collapse during iterative self-improvement cycles?
- Can shifting the accuracy metric itself eliminate the need for diversity post-processing?
- How can semantic diversity optimization work if exploration and exploitation were truly opposed?
- How does diversity collapse during iterative self-improvement affect solution quality?
- Why do rare cases in medicine and science require models that preserve tail distributions?
- Does critique training improve exploration diversity during model training or only test time?
- Can explicitly optimizing for semantic diversity during RL training improve both quality and variation?
- What happens to model grounding when preference optimization increases effective diversity?
- Can dynamic variance weighting replace fixed objective combination weights?
- How does absolute-advantage weighting concentrate training on boundary cases?
- Why does step-level expert alignment work when outcome-only RL fails?
- Should test-time search maximize diversity of competent solutions instead of converging on one strategy?
- How does probability mass concentration affect sampling diversity across model scales?
- Why does outcome-based RL specifically lose diversity during training?
- How much does diversity training cost in single-shot pass@1 performance?
- Which aggregation method best exploits diversity in generated solutions?
- Why does diversity in LLM outputs mask sampling from community priors?
- How does the Learning Law explain why all examples should contribute equally?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
this paper adds taxonomic precision: historical (training) vs batch (test-time) exploration with distinct algorithms
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
outcome-based exploration provides UCB-style bonuses at the outcome level to address collapse
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
diversity is prerequisite for parallel scaling; RL-induced diversity loss degrades it
-
Does reinforcement learning squeeze exploration diversity in search agents?
Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
diversity transfer mechanism operates across domains
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
self-consistency reward creates a specific diversity collapse pathway: optimizing for agreement among samples directly reduces the output diversity that makes self-consistency useful as a signal, creating a self-undermining reward dynamic
-
Why does RLVR training narrow a model's problem solving ability?
RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
diversity loss and capability boundary collapse are the same dynamic at different levels: diversity loss transfers from solved to unsolved problems (this note), while capability boundary collapse describes the resulting scope narrowing; both require exploration mechanisms to counteract
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Outcome-based Exploration for LLM Reasoning
- Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- Jointly Reinforcing Diversity and Quality in Language Model Generations
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- The Invisible Leash: Why RLVR May Not Escape Its Origin
Original note title
outcome-based rl induces diversity loss that transfers from solved to unsolved problems — historical and batch exploration require separate mechanisms