SYNTHESIS NOTE

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Synthesis note · 2026-04-07 · sourced from Autonomous Agents

The OpenClaw-RL framework makes a decomposition that was implicit in prior agentic RL work but never formalized: when an agent acts and the environment responds, the response carries two distinct kinds of information. The evaluative signal scores the action — how well did it perform — and can be extracted as a scalar reward via a PRM judge. The directive signal specifies how the action should have been different — not just that it was wrong, but in what direction. These are orthogonal: high-quality directive information can accompany any evaluation, and scalar rewards systematically lose the directive component.

Consider a user who says "you should have checked the file first." The evaluative content is approximately -1 (the response was inadequate). But the directive content is token-level specific: check the file first. A PRM judge can convert the sentiment into a scalar, but the sequence-level correction vanishes into a single number. Similarly, a detailed SWE error trace often implies a concrete correction direction that scalar outcome rewards cannot convey. Current RLVR methods operate on scalar rewards (Does RLVR actually expand what models can reason about?) and cannot convert directive information into a directional policy gradient. Distillation methods can process structured corrections but require pre-curated feedback-response pairs rather than live signals.

OpenClaw-RL recovers the directive signal through Hindsight-Guided On-Policy Distillation (OPD): extract textual hints from the next state, construct an enhanced teacher context by injecting those hints, and distill token-level directional advantage back into the student policy. This is richer than any scalar reward because it teaches the model not just "that was wrong" but "here is what right looks like in these specific tokens." The empirical result — combining binary PRM-based RL with OPD via weighted loss yields significant gains over either alone — confirms the two signals are complementary, not redundant.

This decomposition matters beyond OpenClaw-RL because it clarifies a conceptual muddle in agentic RL. When people debate "should we use outcome rewards or process rewards, scalar or verbal," the answer is usually "both, decomposed properly." The outcome-vs-process trade-off (Why do outcome-based reward models fail at intermediate step evaluation?) assumes a single signal type. The scalar-vs-verbal distinction is treated as architectural (Can natural language feedback overcome numerical reward plateaus?). OpenClaw-RL reframes them as two projections of one signal: evaluative (dense scalar) and directive (token-level).

The generalization: any learning loop that reduces natural feedback to scalars is discarding the fraction of training signal that most resembles supervised learning. A corrective sentence contains its own teacher.

Inquiring lines that read this note 157

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do multi-agent systems achieve genuine cooperation and reasoning?

Why do reward structures fail to shape long-term agent learning?

How do self-generated feedback mechanisms enable effective model learning?

How do we evaluate AI systems when user perception misleads actual performance?

How can we distinguish genuine user preferences from measurement artifacts?

Can model confidence signals reliably improve reasoning quality and calibration?

What properties determine whether reward signals teach genuine reasoning?

Does RLHF training sacrifice accuracy and grounding for user agreement?

How does RLHF reward structure incentivize agreement over accuracy?

How should models express uncertainty rather than forced confident answers?

How can process reward models supervise complex reasoning traces?

Can alternative training methods improve on supervised fine-tuning for language models?

Can language model RL training avoid reward hacking and misalignment?

Can ensemble evaluation methods reduce bias more than single judges?

How should agents balance memory condensation to optimize context efficiency?

How do interface design choices shape consciousness attribution?

What are the ten intrinsic motivation heuristics that drive participation decisions?

How do aggregate reward models systematically exclude minority user preferences?

What constrains reinforcement learning's ability to expand model reasoning?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Can UCB-style bonuses over outcome space prevent policy entropy collapse?

How can conversational AI maintain consistent personas across conversations?

How does textual-only feedback limit what a persona can learn about users?

How should conversational agents balance goal-driven initiative with user control?

How does asymmetric information between users and agents relate to proactivity?

Can self-supervised signals enable process supervision without human annotation?

Can AI systems balance emotional competence with factual reliability?

Why does consolidated memory sometimes degrade agent performance?

How much actionable detail does condensation strip from raw experience?

Why do agents confidently report success despite actually failing tasks?

Does reinforcement learning teach reasoning or just when to reason?

How does reinforcement learning on outcomes reinforce template-matching rather than computation?

What makes weaker teacher models effective for stronger student training?

How can AI agents autonomously learn and transfer skills across tasks?

How do policy learning algorithm choices affect multi-objective optimization stability?

What determines success in training models on multiple tasks?

How should multi-objective post-training balance competing behavioral goals?

Does externalizing cognitive work and state improve agent reliability?

Why does externalizing bookkeeping raise effective feedback compute?

Why should disagreement be treated as signal in collaborative reasoning?

Why does consensus-seeking destroy information in normative but not factual tasks?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 142 in 2-hop network ·medium cluster Open in graph ↗

Can scalar rewards capture all the information i… Can agent deployment itself generate training sign… Can natural language feedback overcome numerical r… Does binary reward training hurt model calibration… Why do outcome-based reward models fail at interme… Does critiquing errors teach deeper understanding … Does RLVR actually expand what models can reason a…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can agent deployment itself generate training signals automatically? Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
the framing this decomposition operates within
Can natural language feedback overcome numerical reward plateaus? Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
establishes that verbal feedback contains information scalars cannot reach
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
another case where single-scalar objectives miss structure
Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
the outcome/process axis is the wrong cut; evaluative/directive is closer to the information structure
Does critiquing errors teach deeper understanding than imitating correct answers? Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
critique-based training as a cousin: teaching the model the directive structure behind errors
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
scalar RLVR's structural ceiling that directive signals may penetrate

Can scalar rewards capture all the information in agent feedback?

Inquiring lines that read this note 157

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4