What deployment modes work best for trajectory-aware reward signals?
This explores how reward signals that pay attention to an agent's whole trajectory (not just its final answer) are best put to work — where the dense signal comes from, and how it's wired into training.
This explores how 'trajectory-aware' reward signals — feedback that scores the steps along the way, not just the final outcome — are best deployed during training. The corpus suggests the most reliable mode is to mine the signal from structure the agent already produces, rather than bolting on a separately trained process-reward model. Several methods converge here: tree-search rollouts compare sibling branches to turn a single outcome reward into step-level preferences Can tree structure alone convert outcome rewards into process supervision?, and a broader family (Tree-GRPO, Supervised RL, ToolPO) shows that tree topology, expert-aligned actions, and tool-call positions can each substitute for hand-annotated step rewards Can trajectory structure replace hand-annotated process rewards?. The agent's own internal state works too: tracking how much each turn shifts the model's belief toward the solution yields dense per-turn credit with no critic network at all Can an agent's own beliefs guide credit assignment without critics?.
The more interesting finding is that *how* you apply the signal matters as much as where it comes from. One recurring lesson: keep different kinds of judgment in different roles. When rubrics are used as gates that accept or reject whole rollouts — rather than being mashed into a dense numeric reward — they prevent reward hacking while still letting token-level rewards optimize inside the valid answers Can rubrics and dense rewards work together without hacking?. The same separation-of-levels logic shows up when a single statistic (cross-rollout variance) is used two ways at once: weighting tokens densely and filtering out degenerate queries Can one statistical measure serve dual purposes in RL training?.
There's also an asymmetry worth knowing: not every trajectory should be processed the same way. Treating successful episodes as concrete demonstrations and failures as abstracted lessons beats uniform consolidation and uses far less context Should successful and failed episodes be processed differently?. Pushing that further, training on negative trajectories alone — suppressing wrong paths while preserving diversity — can match or exceed full RL, whereas positive-only reinforcement quietly degrades performance at higher sampling Does negative reinforcement alone outperform full reinforcement learning?. So the 'best deployment mode' may depend on whether you're trying to teach good behavior or prune bad behavior.
Two cross-cutting cautions round this out. First, scalar rewards are lossy: agent feedback actually splits into an evaluative part (how good was this?) and a directive part (how should it change?), and a single number throws the directional half away — which argues for richer, token-level distillation alongside the reward Can scalar rewards capture all the information in agent feedback?. Second, the shape of the reward leaks into the model's character — binary correctness rewards provably wreck calibration by rewarding confident guesses, a flaw a proper scoring term can fix Does binary reward training hurt model calibration?. If you want the reward itself to be smarter, letting reward models reason before they score raises their ceiling beyond outcome-only evaluation Can reward models benefit from reasoning before scoring?.
The through-line: trajectory-aware rewards deploy best when the dense signal is harvested from existing structure (tree branches, tool calls, belief shifts), when categorical and continuous judgments are kept in separate roles instead of collapsed together, and when success and failure are handled asymmetrically — not when you simply attach a heavier reward model to the same old outcome score.
Sources 10 notes
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.