How does DVAO balance reward components differently than VPO spreads them?
This explores two opposite answers to the same problem — what to do when training has multiple reward signals at once — where DVAO collapses them into one balanced number and VPO refuses to collapse them at all.
This explores how DVAO and VPO handle the same starting problem (a model being trained against several reward objectives simultaneously) but pull in opposite directions: DVAO *balances* the rewards into a single combined signal, while VPO deliberately keeps them *spread apart*. Seeing them side by side is the interesting part — they're not competing implementations of one idea, they're two philosophies about whether multiple objectives should ever be merged. DVAO's move is to weight each objective by its empirical within-group variance per rollout, automatically turning up objectives that carry strong signal and damping the noisy ones, all without hand-tuned scalarization constants How should multiple reward objectives be weighted during training?. The goal is a clean, bounded advantage number the policy can chase.
VPO starts from the suspicion that merging is exactly what destroys something valuable. By keeping rewards decomposed per test-case, criterion, or persona — never scalarized — it treats the spread between objectives as a built-in diversity axis, training solutions to span the Pareto frontier of real trade-offs rather than converge on one blended optimum Can reward vectors be the hidden source of solution diversity?. So the same multi-objective setup that DVAO sees as noise to be averaged out, VPO sees as structure to be preserved. DVAO asks 'which signal do I trust most right now?'; VPO asks 'how do I keep all these signals visibly in tension?'
The tension between them shows up elsewhere in the corpus, which suggests this is a recurring fault line rather than a one-off disagreement. There's evidence that scalar collapse genuinely throws information away: agent feedback decomposes into an evaluative part (how good the action was) and a directive part (how it should change), and a single scalar can capture the first but not the second Can scalar rewards capture all the information in agent feedback?. The same structural argument appears for human preference: aggregating disagreeing users into one reward model isn't a quality bug, it's a representational impossibility — a 51-49 split can't be honored by one number Can aggregate reward models satisfy genuinely disagreeing users?. Those notes are effectively VPO's home-field advantage: when the objectives are genuinely irreconcilable, balancing them is the wrong verb.
But DVAO has its own backing. There's a separate strategy of not converting every signal into a dense reward at all — using rubrics as accept/reject gates rather than as scores, which preserves their categorical strength while letting other rewards optimize underneath Can rubrics and dense rewards work together without hacking?. That's a third position: some signals should be merged, some should gate, some should stay vectorized. And the sober reminder underneath all of it is that advantage normalization and a few plumbing choices often matter more than the algorithm's name — the pretrained prior tends to set the ceiling regardless Can two simple techniques match complex RL algorithms?.
The thing worth walking away with: 'balance' and 'spread' aren't two flavors of the same optimizer — they encode a bet about whether your objectives are noisy versions of one true reward (balance them, DVAO-style) or genuinely conflicting goods that a good model should hold in tension (spread them, VPO-style). The right answer depends entirely on which of those your task actually is.
Sources 6 notes
DVAO weights objectives by their within-group variance, automatically up-weighting high-signal objectives and suppressing noise without hyperparameter tuning. This keeps advantage magnitudes bounded and replaces fixed scalarization constants with data-driven weighting.
Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Advantage normalization and token-level loss aggregation allow critic-free PPO to surpass more complex algorithms. Systematic evaluation shows most RL techniques are setup-sensitive; the pretrained prior, not algorithm choice, sets performance ceiling.