What makes advantage shaping more stable than reward shaping for tool training?
This explores why shaping the *advantage* (the post-baseline signal that tells a model how much better an action was than its peers) tends to train tool-use more stably than shaping the *reward* itself (injecting bonuses and penalties into the raw score before the math), and what the corpus says about where reward shaping goes wrong.
This question is really about *where* in the training pipeline you intervene. Reward shaping edits the raw score before the algorithm computes how good an action was relative to alternatives; advantage shaping edits that relative signal directly. The corpus keeps circling one explanation for why the later intervention is steadier: raw rewards are unbounded and gameable, while advantages are already normalized against a model's own peer rollouts — so shaping them keeps magnitudes in a controlled range and harder to exploit.
The clearest mechanism comes from variance-aware weighting How should multiple reward objectives be weighted during training?, which weights each objective by its within-group variance specifically to keep advantage magnitudes bounded and to replace fixed scalarization constants — those hand-tuned reward weights — with data-driven ones. That's the crux: reward shaping forces you to pick scaling constants up front, and a tool-call bonus that's too large drowns out the correctness signal, while one too small does nothing. Operating in advantage space sidesteps the tuning entirely because everything is already measured relative to the group.
The deeper reason reward shaping is fragile is that rewards are the surface a model learns to hack. The rubric work Can rubrics and dense rewards work together without hacking? makes this concrete: when rubric scores are *converted into dense rewards*, models hack them; when the same rubric instead *gates* which rollouts are even allowed to count, hacking disappears. That's the same instinct as advantage shaping — keep the categorical 'is this trajectory valid' judgment separate from the continuous 'how much do I optimize within it' signal. For tool training, where a malformed call or a hallucinated API should be a hard rejection rather than a slightly-lower number, that separation matters enormously.
There's also a calibration story underneath. Binary correctness rewards quietly degrade a model into confident guessing because they never penalize confident-but-wrong actions Does binary reward training hurt model calibration?, and asymmetric, utility-weighted losses can sharpen decisions while starving the underlying representation Can utility-weighted training loss actually harm model performance?. Both are cautionary tales about distorting the reward: the distortion propagates through every gradient. Shaping the advantage is gentler because it adjusts emphasis without rewriting what 'correct' means. Relatedly, scalar rewards simply can't carry everything a tool environment tells you — feedback splits into 'how well it went' and 'how it should change' Can scalar rewards capture all the information in agent feedback? — so overloading the reward channel with shaped bonuses tends to corrupt the evaluative signal you actually need.
The payoff to notice: tool training has a phase structure. Models first master execution correctness, then hit a planning bottleneck Does RL training follow a predictable two-phase learning sequence?, and a fixed shaped reward that was tuned for phase one becomes wrong in phase two. Advantage shaping, because it re-derives emphasis from the live distribution of rollouts each step, adapts to that shifting bottleneck instead of fighting it — which is the real source of its stability.
Sources 6 notes
DVAO weights objectives by their within-group variance, automatically up-weighting high-signal objectives and suppressing noise without hyperparameter tuning. This keeps advantage magnitudes bounded and replaces fixed scalarization constants with data-driven weighting.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.