Can reward vectors be the hidden source of solution diversity?
Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?
Diversity objectives in RL often feel arbitrary — you bolt on an entropy bonus or a novelty penalty and hope it spreads the policy without wrecking quality. Vector Policy Optimization makes the observation that the diversity axis is frequently already present in the reward structure and just gets thrown away. Rewards are vector-valued in practice: per-test-case correctness in code generation, per-criterion ratings in RLHF, per-sub-question success in multi-hop reasoning, or multiple user personas or reward models. Standard pipelines scalarize this vector into one number before computing advantage, discarding the component structure.
The pattern: keep the vector, and use its components as the dimensions along which solutions specialize. Rather than collapsing onto a single Pareto point, VPO combines multi-answer generation with stochastic reward scalarizations, training the model to emit a set of candidates that span the Pareto frontier — one solution that nails edge-case tests, another that optimizes the common path, another that trades correctness for brevity. The diversity is grounded in real trade-offs the task already encodes rather than imposed by an external regularizer, which is why it produces competent diversity rather than noise.
Why it matters: it reframes "where does diversity come from?" The answer is that the multi-objective structure of the reward is the diversity structure, latent until you stop scalarizing. This connects diversity-for-search to the broader multi-objective RL problem: the same vector reward that one method (DVAO) wants to balance for stability, VPO wants to spread across to cover the frontier. The counterpoint is that not every task has a meaningful reward vector — single-answer verifiable tasks with one binary reward offer no natural axis to specialize along.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do evolutionary algorithms collapse to single solutions under selection pressure?
- How does forced exploration through diversity rewards differ from suppression-based negative reinforcement?
- Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?
- Can structural diversity through role assignment replace emergent diversity in small models?
- Why does positive reinforcement degrade diversity at higher k values?
- How does diversity collapse during iterative self-improvement cycles?
- How does diversity collapse during iterative self-improvement affect solution quality?
- How do gradients flowing through both branches simultaneously reshape each component's role?
- How does directional diversity compare to other forms of parallel planning?
- Can vector-valued rewards preserve specialization better than variance-weighted advantages?
- Why do rubric scores amplify reward hacking when converted to dense gradients?
- How does DVAO balance reward components differently than VPO spreads them?
- Why does outcome-based RL specifically lose diversity during training?
- When does a task lack a meaningful multi-dimensional reward structure?
- Does semantic diversity in output space compete with reward-component diversity?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How should multiple reward objectives be weighted during training?
When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
the dual move on the same vector reward: DVAO balances components for stability while VPO spreads solutions across components for coverage
-
Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
alternative diversity source (semantic, in output space) versus VPO's reward-component source; both refute the diversity-costs-quality assumption
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
another mechanism for sustaining diverse competent candidates, via critique rather than reward decomposition
-
Does outcome-based RL diversity loss spread across unsolved problems?
When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
diagnoses the diversity-loss failure that vector rewards are one structural antidote to
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Vector Policy Optimization: Training for Diversity Improves Test-Time Search
- Jointly Reinforcing Diversity and Quality in Language Model Generations
- DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
- Reinforcement Learning with Rubric Anchors
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- The Art of Scaling Reinforcement Learning Compute for LLMs
Original note title
vector-valued rewards give a natural diversity axis by letting solutions specialize along different reward dimensions