SYNTHESIS NOTE

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Synthesis note · 2026-02-22 · sourced from Reward Models

Outcome-based RL (rewarding only final answer correctness) produces substantial accuracy gains but systematically reduces generation diversity. This is known. What is new: the diversity loss transfers across problems. Concentrating probability mass on correct answers for solved problems propagates to unsolved problems — the model's entire output distribution narrows, not just its distribution on problems it can solve.

The transfer mechanism: RL sharpens the policy globally, not per-problem. When the model learns to concentrate on correct trajectories for problems it has solved, the reduced diversity in its generative distribution also manifests as reduced diversity on problems it has not solved. This means RL can reduce effective diversity even on the training set relative to the base model.

The practical consequence: diversity is critical for test-time scaling. Since Why does parallel reasoning outperform single chain thinking?, diverse parallel samples are more valuable than many copies of similar reasoning. And since Why does majority voting outperform more complex inference methods?, voting requires genuine diversity to work — voting over near-identical samples provides no signal.

The key conceptual contribution is distinguishing two forms of exploration:

Historical exploration — visiting diverse states and actions during training. Improves pass@1 (single-attempt accuracy) because the model encounters more training signal. Does not guarantee test-time diversity.
Batch exploration — producing diverse outputs at test time. Improves pass@k (k-attempt coverage) because outputs span more of the solution space. Does not improve training diversity.

These require different mechanisms. Historical exploration uses UCB-style bonuses over outcome space (tractable because reasoning tasks have a limited set of distinct final answers). Batch exploration uses within-batch repetition penalties. The distinction directly instantiates Why do reasoning models fail differently at training versus inference? — historical/batch exploration maps onto training-time/test-time with concrete algorithmic prescriptions.

Inquiring lines that read this note 43

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What constrains reinforcement learning's ability to expand model reasoning?

Why does RLVR increase token entropy while decreasing answer diversity?

How can AI systems learn from failures without cascading errors?

Can population diversity in self-improvement prevent error avalanching failures?

How does objective evolution guide discovery better than fixed planning?

Why do evolutionary algorithms collapse to single solutions under selection pressure?

When does optimizing for quality undermine the value of diversity?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

What are the consequences of models training on synthetic data?

How does test-time aggregation affect reasoning correctness and reliability?

How does majority voting fail when reasoning samples lack genuine diversity?

Does reinforcement learning teach reasoning or just when to reason?

How does RL compress reasoning path diversity during training?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How should we design LLM systems to maintain alignment and control?

How does KL penalty strength affect the degree of format collapse during RL?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Is distribution selection during RL the same compression mechanism as entropy collapse?

How can identical external performance mask different internal representations?

Why do rare cases in medicine and science require models that preserve tail distributions?

Why does finetuning cause catastrophic forgetting of model capabilities?

Can dynamic variance weighting replace fixed objective combination weights?

How can process reward models supervise complex reasoning traces?

Why does step-level expert alignment work when outcome-only RL fails?

How should inference compute be adaptively allocated based on prompt difficulty?

Should test-time search maximize diversity of competent solutions instead of converging on one strategy?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 158 in 2-hop network ·medium cluster Open in graph ↗

Does outcome-based RL diversity loss spread acro… Why do reasoning models fail differently at traini… Does policy entropy collapse limit reasoning perfo… Why does parallel reasoning outperform single chai… Does reinforcement learning squeeze exploration di… Does self-consistency reliably reward correct answ… Why does RLVR training narrow a model's problem so…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
this paper adds taxonomic precision: historical (training) vs batch (test-time) exploration with distinct algorithms
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
outcome-based exploration provides UCB-style bonuses at the outcome level to address collapse
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
diversity is prerequisite for parallel scaling; RL-induced diversity loss degrades it
Does reinforcement learning squeeze exploration diversity in search agents? Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.
diversity transfer mechanism operates across domains
Does self-consistency reliably reward correct answers during training? Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
self-consistency reward creates a specific diversity collapse pathway: optimizing for agreement among samples directly reduces the output diversity that makes self-consistency useful as a signal, creating a self-undermining reward dynamic
Why does RLVR training narrow a model's problem solving ability? RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
diversity loss and capability boundary collapse are the same dynamic at different levels: diversity loss transfers from solved to unsolved problems (this note), while capability boundary collapse describes the resulting scope narrowing; both require exploration mechanisms to counteract

Does outcome-based RL diversity loss spread across unsolved problems?

Inquiring lines that read this note 43

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4