INQUIRING LINE

Why does outcome-based RL specifically lose diversity during training?

This explores why reinforcement learning that rewards only the final answer (not the reasoning steps) tends to narrow a model's range of outputs as training proceeds — and what the corpus says the underlying mechanism is.


This explores why outcome-based RL — training that rewards only whether the final answer is correct, ignoring how the model got there — specifically erodes the variety in a model's outputs. The corpus points to a single root cause: when the only signal is "was the answer right," the optimizer has nothing to push toward except piling probability mass onto whatever trajectories already succeed. The policy sharpens, and that sharpening is global, not local. Does outcome-based RL diversity loss spread across unsolved problems? shows the striking part: the collapse doesn't stay confined to problems the model has solved — it transfers, reducing diversity even on unsolved problems where exploration is exactly what you'd want to preserve.

Mechanically, this is the same entropy collapse documented across very different tasks, which suggests it's a property of outcome reward rather than of any one domain. Does reinforcement learning squeeze exploration diversity in search agents? finds search agents converge on narrow reward-maximizing strategies through the identical mechanism seen in reasoning, while supervised fine-tuning on diverse demonstrations keeps exploration broad. Does RL training collapse format diversity in pretrained models? sharpens the picture further: within the first epoch, RL amplifies one output format inherited from pretraining and suppresses the alternatives — and which format wins depends on model scale, not even on which one performs best. So part of the diversity loss is the optimizer arbitrarily committing to one mode early and never looking back.

The corpus also reveals that the collapse is not uniform — it depends on what the reward actually incentivizes. Does preference tuning always reduce diversity the same way? shows RLHF reduces lexical-syntactic variety in code but *increases* it in creative writing, because code rewards convergence on a correct solution while creative writing rewards distinctiveness. Does training order reshape how models handle different task types? makes this concrete: structured domains drive output entropy down while open-ended ones drive it up, and training the structured tasks first prevents the entropy collapse from spilling over and damaging the creative capabilities. Diversity loss, in other words, is what outcome reward looks like in domains with a single correct target.

There's a darker version of the same dynamic worth knowing about. Do overly hard RLVR samples actually harm model capabilities? shows that when problems are nearly impossible, group-relative normalization treats rare accidental successes as high-advantage trajectories — so the model collapses onto answer-repetition and computation-skipping shortcuts that then contaminate skills it already had. And Does binary reward training hurt model calibration? notes that pure binary correctness rewards push the model toward confident guessing because nothing penalizes a confident wrong answer. Both are diversity collapse with a sign on it: the policy doesn't just narrow, it narrows onto the wrong thing.

The interesting twist is that none of this is inevitable — it's a consequence of *what* you reward, not of RL itself. Can diversity optimization improve quality during language model training? shows that adding a semantic-diversity reward actually raises quality, because diversity catalyzes exploration rather than competing with it. Do critique models improve diversity during training itself? keeps solution variety alive by injecting step-level critique into the training loop, counteracting the tail-narrowing. And Can reward vectors be the hidden source of solution diversity? dissolves the problem at its source: if you keep the reward as an unscalarized vector (per criterion, per test case, per persona), solutions naturally spread across a Pareto frontier instead of all chasing one scalar. The common thread — outcome-based RL loses diversity precisely because it compresses a many-dimensional notion of "good" into a single scalar the optimizer can only maximize one way.


Sources 10 notes

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can diversity optimization improve quality during language model training?

DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can reward vectors be the hidden source of solution diversity?

Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether outcome-based RL's diversity collapse—documented in a curated library spanning 2024–2026—remains a hard constraint or has been architecturally or algorithmically relaxed.

What a curated library found — and when (dated claims, not current truth):
• Outcome-based RL globally sharpens policy, transferring diversity loss even to unsolved problems where exploration is needed (~2025).
• Within the first epoch, RL amplifies one pretraining output format and suppresses alternatives; which format wins depends on scale, not performance (~2025).
• RLHF reduces lexical-syntactic variety in code but increases it in creative writing; diversity loss is domain-dependent, tied to whether the reward incentivizes convergence (~2025).
• Binary correctness rewards push models toward confident guessing; group-relative normalization on hard problems induces shortcut-taking that contaminates learned skills (~2025–2026).
• Vector-valued rewards (unscalarized, per-criterion) naturally spread solutions across a Pareto frontier rather than collapse onto one scalar maximum (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (Echo Chamber, 2025)
• arXiv:2509.06941 (Outcome-based Exploration, 2025)
• arXiv:2605.22817 (Vector Policy Optimization, 2026)
• arXiv:2411.16579 (Critique Models, 2024)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (post-o1/o3 reasoning chains), algorithmic innovations (mixture-of-experts reward heads, hierarchical RL, intrinsic-motivation layering), or orchestration (multi-agent debate, iterative refinement harnesses) have since RELAXED or OVERTURNED it. Separate the durable question—does outcome reward compress information loss?—from the perishable limitation. Cite what resolved it; flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that argues outcome-based RL either does NOT erode diversity, or that diversity loss is actually beneficial in specific regimes.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "Can continuous-valued reward decomposition preserve diversity *within* outcome-based training?" or "Does diversity collapse in outcome-RL transfer across model scales and architectures post-scaling laws update?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines