What information do next-state signals contain beyond what scalar rewards capture?
This explores what an agent's next-state feedback (what actually happened after it acted) tells you that a single number — the reward — leaves out.
This explores what an agent's next-state feedback carries beyond a scalar reward — and the corpus has a surprisingly clean answer: a reward tells you *how well* you did, but the next state tells you *what to do about it*. One note splits agent feedback into two orthogonal channels — evaluative (a verdict on the action) and directive (which way to change it) — and shows scalar rewards capture only the first while discarding the second, which is exactly the part token-level distillation can recover Can scalar rewards capture all the information in agent feedback?. That directive content is the throughline of everything else here.
The practical payoff is denser credit assignment. A scalar reward arrives once, at the end, and says nothing about *where* things went wrong; rich next-state signals can be turned into per-step gradients. One method feeds the policy retrospective in-context evidence of its own mistakes so it becomes its own process reward model, making external reward signals unnecessary Can environment feedback replace scalar rewards in policy learning?. Another extracts dense per-turn credit from the agent's own shifting belief about the answer — no critic network required Can an agent's own beliefs guide credit assignment without critics?. A third treats the future states an agent reaches as direct supervision, learning from consequences with no external reward at all, matching expert-trained baselines on half the data Can agents learn from their own actions without external rewards?. And a family of methods mines the *structure* of trajectories — tree topology, tool-call positions, expert-aligned steps — to manufacture process rewards from sparse outcomes Can trajectory structure replace hand-annotated process rewards?.
The sharpest framing comes from natural-language feedback: models stuck on a numerical-reward plateau start solving problems again once given a written critique explaining *why* they failed — direct evidence that scalar rewards are missing the "why" and "how to fix it," not just the "how much" Can natural language feedback overcome numerical reward plateaus?. This is the same gap the evaluative/directive split predicts, arriving from a different direction.
There's a deeper reason this matters, which the corpus surfaces almost as a warning. Several notes argue that reward-only RL (RLVR) doesn't expand what a model can do — it just sharpens sampling toward solutions the base model already had, sometimes activated by a single example or even spurious rewards What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. If scalar rewards can only re-weight existing behavior, then the *new* capability has to come from somewhere richer — and distillation of directive, token-level signal is exactly what's shown to transfer genuinely new reasoning patterns. The information next-state signals add isn't a nicety; it may be the part that actually teaches.
Worth a final twist: more reward structure isn't always the answer to extracting this information. One line of work finds that *negative* signal alone — suppressing wrong trajectories while preserving diversity — matches full RL Does negative reinforcement alone outperform full reinforcement learning?, and another argues rubrics work better as accept/reject *gates* than as converted-to-dense rewards, because forcing rich categorical judgment into a scalar invites reward hacking Can rubrics and dense rewards work together without hacking?. The lesson across all of it: the directive information is real and valuable, but it degrades the moment you flatten it back into a single number.
Sources 10 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.