How does forced exploration through diversity rewards differ from suppression-based negative reinforcement?
This explores two opposite-feeling ways to keep an RL-trained model from collapsing into one narrow strategy — actively rewarding variety versus just punishing wrong answers — and asks how they actually differ in mechanism and effect.
This explores two opposite-feeling fixes for the same disease: RL training tends to flatten a model's behavioral range, and you can fight that either by *adding* a reward for being diverse or by *only subtracting* reward from wrong answers. They look similar in outcome (more diversity preserved) but work through very different machinery. The starting problem is well documented: outcome-based RL that rewards only final-answer correctness sharpens the policy globally, piling probability mass onto correct trajectories — and that diversity loss even bleeds from solved problems onto unsolved ones Does outcome-based RL diversity loss spread across unsolved problems?. The same entropy-collapse mechanism shows up in search agents, where RL squeezes exploration breadth just as it does in reasoning Does reinforcement learning squeeze exploration diversity in search agents?.
Suppression-based negative reinforcement is the lighter-touch answer. Training on only negative samples — pushing down incorrect trajectories without ever rewarding the correct ones — consistently improves Pass@k and often matches full PPO or GRPO Does negative reinforcement alone outperform full reinforcement learning?. The key is what it *doesn't* do: by never concentrating probability mass on a single winning answer, it leaves the model's existing spread of correct solutions intact. Positive reinforcement is the culprit that degrades higher-k performance; remove it and diversity survives as a byproduct. So negative reinforcement doesn't *create* diversity — it *avoids destroying* the diversity already in the base model.
Forced exploration through diversity rewards is the opposite stance: it injects structure rather than withholding pressure. Vector-valued rewards keep the reward signal unscalarized — decomposed per test-case, criterion, or persona — so solutions are pushed to span a Pareto frontier of real task trade-offs instead of converging on one optimum Can reward vectors be the hidden source of solution diversity?. This is 'competent diversity' grounded in the task, not noise added by a regularizer. Critique models do something parallel from inside the training loop, counteracting tail-narrowing so the policy doesn't prematurely converge across self-training iterations Do critique models improve diversity during training itself?.
The deeper distinction is *where the diversity comes from*. Suppression is conservative — it protects the model's pretrained breadth by refusing to over-commit. Diversity rewards are generative — they manufacture spread along axes the designer chose. That's why the two failure modes differ: suppression can only preserve diversity that already exists in the base model, while reward-shaped exploration can push the model toward regions a pure suppress-the-wrong scheme would never reach. Notably, the corpus suggests these even need structurally different mechanisms — training-time exploration via UCB-style bonuses versus test-time diversity via repetition penalties are not interchangeable Does outcome-based RL diversity loss spread across unsolved problems?.
What you didn't know you wanted to know: diversity preservation isn't one knob but a spectrum of *asymmetric* signal-handling, and the most efficient approaches lean asymmetric on purpose. Treating successful and failed episodes differently — successes as concrete demonstrations, failures as abstracted lessons — beats processing them uniformly Should successful and failed episodes be processed differently?. And whether suppression even *helps* depends on domain: preference tuning reduces diversity in code (where convergence is correct) but increases it in creative writing (where distinctiveness is the reward) Does preference tuning always reduce diversity the same way?. So the real question isn't 'reward variety or punish errors' — it's whether your task wants the model to converge or to spread, and which asymmetry gets you there cheapest.
Sources 7 notes
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.