INQUIRING LINE

Why does scalarization of rewards fail for multi-objective GRPO training?

This explores why collapsing several reward objectives into a single weighted sum (scalarization) breaks down when training reasoning models with GRPO — and what the corpus offers as alternatives.


This explores why collapsing several reward objectives into one weighted sum breaks down in GRPO training. The short version from the corpus: a fixed scalar throws away information that the objectives don't share, and the constants you pick to combine them are both arbitrary and unstable.

The most direct answer is about signal strength. GRPO computes advantages from the *spread* of rewards within a group of rollouts, so an objective that's noisy on one batch and sharp on another gets the same fixed weight either way — and noise leaks straight into the advantage. DVAO's response is to stop using fixed constants entirely and weight each objective by its empirical within-group variance, automatically amplifying whichever signal is informative right now and suppressing the rest, which also keeps advantage magnitudes bounded How should multiple reward objectives be weighted during training?. That reframes the failure: scalarization isn't just hard to tune, it's tuning a constant where the right value is data-dependent.

There's a deeper reason a single number can't carry the load: rewards mix kinds of information, not just amounts. Agent feedback decomposes into an *evaluative* part (how good was this?) and a *directive* part (what should change?), and a scalar captures the first while discarding the second — making them complementary rather than substitutable Can scalar rewards capture all the information in agent feedback?. Critique-GRPO makes the same point from the plateau side: models stuck under numerical rewards start solving problems again once given chain-of-thought critiques, because the scalar never told them *why* they failed Can natural language feedback overcome numerical reward plateaus?. Scalarizing across objectives compounds this — you're summing away exactly the structure each objective was meant to preserve.

The corpus also shows scalarization actively *rewards the wrong thing*. Binary correctness rewards push models toward confident wrong answers because nothing in the scalar penalizes overconfidence; the fix is to add the Brier score as a second term that provably co-optimizes accuracy and calibration without trade-off Does binary reward training hurt model calibration?. And when people *do* try to fold a qualitative objective into the dense reward sum, it gets hacked — DRO finds that using rubrics as *gates* that accept or reject whole rollout groups, rather than converting rubric scores into reward points, prevents the gaming while preserving the rubric's categorical strength Can rubrics and dense rewards work together without hacking?.

The throughline: scalarization fails not because the math is wrong but because it presumes objectives are commensurable and stationary, when in GRPO they're neither. The alternatives the corpus reaches for — variance-based weighting, second terms with provable joint guarantees, gating instead of summing, language critiques alongside numbers — all share a move away from 'one number to rule them all.' Worth knowing alongside this: even negative-only reinforcement can match full GRPO by suppressing wrong trajectories while preserving diversity, a reminder that *what* you optimize against often matters more than how cleverly you weight it Does negative reinforcement alone outperform full reinforcement learning?.


Sources 6 notes

How should multiple reward objectives be weighted during training?

DVAO weights objectives by their within-group variance, automatically up-weighting high-signal objectives and suppressing noise without hyperparameter tuning. This keeps advantage magnitudes bounded and replaces fixed scalarization constants with data-driven weighting.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about multi-objective RL in LLMs. The question: Why does scalarization of rewards fail for multi-objective GRPO training?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable:

• Fixed-weight scalar reward aggregation leaks noise into advantage signals in GRPO; variance-adaptive weighting (DVAO, ~2026) empirically suppresses noise by scaling each objective by its within-batch variance instead.
• Scalar rewards discard directive information (how to improve) while preserving only evaluative content (how good); natural-language critiques restore directive signal when numerical-only rewards plateau (Critique-GRPO, ~2025).
• Binary correctness + scalar sum degrades calibration; adding a proper scoring rule (Brier) as a second term provably co-optimizes accuracy and calibration without trade-off (~2025).
• Rubric-as-gate (hard feasibility filter) outperforms rubric-as-reward-points (DRO, ~2025); summing qualitative objectives into dense rewards invites gaming.
• Negative-only reinforcement matches or exceeds full GRPO by suppressing wrong trajectories; what you optimize *against* often outweighs weighting method (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.03106 Critique-GRPO (2025-06)
• arXiv:2605.25604 DVAO (2026-05)
• arXiv:2506.13351 Direct Reasoning Optimization (2025-06)
• arXiv:2506.01347 Negative Reinforcement (2025-06)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—variance-adaptive weighting, directive vs. evaluative decomposition, Brier-score co-optimization, rubric gating, negative-only RL—judge whether newer training harnesses, multi-agent scaffolding, model scaling (post-2026 frontier), or fresh evaluation protocols have since RELAXED or OVERTURNED each. Separate the durable open question (multi-objective commensurability in group-advantage methods) from perishable implementation details. What concretely resolved or re-opened each?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper claim scalarization *does* work at scale, or propose an orthogonal regime where fixed weights become unproblematic?
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., does end-to-end learned objective weighting (not variance-adaptive, but gradient-optimized across episodes) outperform variance-adaptive, or do newer critiques render numerical weighting obsolete?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines