SYNTHESIS NOTE

Can reward vectors be the hidden source of solution diversity?

Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?

Synthesis note · 2026-05-28 · sourced from Reinforcement Learning

Diversity objectives in RL often feel arbitrary — you bolt on an entropy bonus or a novelty penalty and hope it spreads the policy without wrecking quality. Vector Policy Optimization makes the observation that the diversity axis is frequently already present in the reward structure and just gets thrown away. Rewards are vector-valued in practice: per-test-case correctness in code generation, per-criterion ratings in RLHF, per-sub-question success in multi-hop reasoning, or multiple user personas or reward models. Standard pipelines scalarize this vector into one number before computing advantage, discarding the component structure.

The pattern: keep the vector, and use its components as the dimensions along which solutions specialize. Rather than collapsing onto a single Pareto point, VPO combines multi-answer generation with stochastic reward scalarizations, training the model to emit a set of candidates that span the Pareto frontier — one solution that nails edge-case tests, another that optimizes the common path, another that trades correctness for brevity. The diversity is grounded in real trade-offs the task already encodes rather than imposed by an external regularizer, which is why it produces competent diversity rather than noise.

Why it matters: it reframes "where does diversity come from?" The answer is that the multi-objective structure of the reward is the diversity structure, latent until you stop scalarizing. This connects diversity-for-search to the broader multi-objective RL problem: the same vector reward that one method (DVAO) wants to balance for stability, VPO wants to spread across to cover the frontier. The counterpoint is that not every task has a meaningful reward vector — single-answer verifiable tasks with one binary reward offer no natural axis to specialize along.

Inquiring lines that read this note 18

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does objective evolution guide discovery better than fixed planning?

Why do evolutionary algorithms collapse to single solutions under selection pressure?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

What structural factors drive popularity bias in recommendation systems?

Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?

When does optimizing for quality undermine the value of diversity?

What determines success in training models on multiple tasks?

How do gradients flowing through both branches simultaneously reshape each component's role?

How do aggregate reward models systematically exclude minority user preferences?

Can vector-valued rewards preserve specialization better than variance-weighted advantages?

Can language model RL training avoid reward hacking and misalignment?

Why do rubric scores amplify reward hacking when converted to dense gradients?

What constrains reinforcement learning's ability to expand model reasoning?

How does DVAO balance reward components differently than VPO spreads them?

What properties determine whether reward signals teach genuine reasoning?

When does a task lack a meaningful multi-dimensional reward structure?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 109 in 2-hop network ·medium cluster Open in graph ↗

Can reward vectors be the hidden source of solut… How should multiple reward objectives be weighted … Can diversity optimization improve quality during … Do critique models improve diversity during traini… Does outcome-based RL diversity loss spread across…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should multiple reward objectives be weighted during training? When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
the dual move on the same vector reward: DVAO balances components for stability while VPO spreads solutions across components for coverage
Can diversity optimization improve quality during language model training? Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
alternative diversity source (semantic, in output space) versus VPO's reward-component source; both refute the diversity-costs-quality assumption
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
another mechanism for sustaining diverse competent candidates, via critique rather than reward decomposition
Does outcome-based RL diversity loss spread across unsolved problems? When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
diagnoses the diversity-loss failure that vector rewards are one structural antidote to

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

vector-valued rewards give a natural diversity axis by letting solutions specialize along different reward dimensions

Can reward vectors be the hidden source of solution diversity?

Inquiring lines that read this note 18

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4