INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Why do reward structures fail to s…›this inquiring line

A reward score tells an AI how well it did — but the next state it lands in tells it how to do better.

What information do next-state signals contain beyond what scalar rewards capture?

This explores what an agent's next-state feedback (what actually happened after it acted) tells you that a single number — the reward — leaves out.

This explores what an agent's next-state feedback carries beyond a scalar reward — and the corpus has a surprisingly clean answer: a reward tells you *how well* you did, but the next state tells you *what to do about it*. One note splits agent feedback into two orthogonal channels — evaluative (a verdict on the action) and directive (which way to change it) — and shows scalar rewards capture only the first while discarding the second, which is exactly the part token-level distillation can recover Can scalar rewards capture all the information in agent feedback?. That directive content is the throughline of everything else here.

The practical payoff is denser credit assignment. A scalar reward arrives once, at the end, and says nothing about *where* things went wrong; rich next-state signals can be turned into per-step gradients. One method feeds the policy retrospective in-context evidence of its own mistakes so it becomes its own process reward model, making external reward signals unnecessary Can environment feedback replace scalar rewards in policy learning?. Another extracts dense per-turn credit from the agent's own shifting belief about the answer — no critic network required Can an agent's own beliefs guide credit assignment without critics?. A third treats the future states an agent reaches as direct supervision, learning from consequences with no external reward at all, matching expert-trained baselines on half the data Can agents learn from their own actions without external rewards?. And a family of methods mines the *structure* of trajectories — tree topology, tool-call positions, expert-aligned steps — to manufacture process rewards from sparse outcomes Can trajectory structure replace hand-annotated process rewards?.

The sharpest framing comes from natural-language feedback: models stuck on a numerical-reward plateau start solving problems again once given a written critique explaining *why* they failed — direct evidence that scalar rewards are missing the "why" and "how to fix it," not just the "how much" Can natural language feedback overcome numerical reward plateaus?. This is the same gap the evaluative/directive split predicts, arriving from a different direction.

There's a deeper reason this matters, which the corpus surfaces almost as a warning. Several notes argue that reward-only RL (RLVR) doesn't expand what a model can do — it just sharpens sampling toward solutions the base model already had, sometimes activated by a single example or even spurious rewards What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. If scalar rewards can only re-weight existing behavior, then the *new* capability has to come from somewhere richer — and distillation of directive, token-level signal is exactly what's shown to transfer genuinely new reasoning patterns. The information next-state signals add isn't a nicety; it may be the part that actually teaches.

Worth a final twist: more reward structure isn't always the answer to extracting this information. One line of work finds that *negative* signal alone — suppressing wrong trajectories while preserving diversity — matches full RL Does negative reinforcement alone outperform full reinforcement learning?, and another argues rubrics work better as accept/reject *gates* than as converted-to-dense rewards, because forcing rich categorical judgment into a scalar invites reward hacking Can rubrics and dense rewards work together without hacking?. The lesson across all of it: the directive information is real and valuable, but it degrades the moment you flatten it back into a single number.

Sources 10 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Show all 10 sources

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Intrinsic Credit Assignment for Long Horizon Interaction3.40 match · arxiv ↗
Reward Reasoning Model3.32 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin2.63 match · arxiv ↗
Reinforcement Learning via Self-Distillation2.55 match · arxiv ↗
OpenClaw-RL: Train Any Agent Simply by Talking2.46 match · arxiv ↗
Efficient Reinforcement Learning via Large Language Model-based Search2.45 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.78 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.76 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, evaluate this still-open question: *What information do next-state signals contain beyond scalar rewards, and does extracting it actually expand model capability or only reshuffle existing capacity?*

What a curated library found — and when (dated claims, not current truth):
Findings span April 2025–February 2026. The library identifies:
• Reward signals are *evaluative* (verdict on action); next-state signals are *directive* (how to change) — orthogonal channels, scalar rewards discard the second [[2025-04]].
• Dense per-token credit assignment from rich environment feedback matches expert baselines on half the data; RLVR alone doesn't expand reasoning capacity beyond the base model's latent range [[2025-04, 2025-06]].
• Natural-language critique breaks numerical-reward plateaus by explaining *why* failure occurred, not just penalizing it [[2025-06]].
• Negative reinforcement (suppressing wrong trajectories) matches full RL; forcing rubrics into scalars invites reward hacking [[2025-06]].
• Belief-shift toward target and trajectory-structure mining (tree topology, tool calls) yield dense intrinsic credit without external rewards [[2025-09, 2026-02]].

Anchor papers (verify; mind their dates):
• 2025-04: arXiv:2504.13837 — RLVR capability boundaries
• 2025-06: arXiv:2506.13351 — token-level reasoning + rubric gates
• 2025-06: arXiv:2506.03106 — natural-language + numerical feedback
• 2026-02: arXiv:2602.12342 — intrinsic credit assignment

Your task:
(1) RE-TEST EACH CONSTRAINT. For the evaluative/directive split and RLVR-doesn't-expand claims: has newer orchestration (multi-step reasoning, agentic frameworks, memory integration), training methods (DPO variants, KL penalties), or evals shown these limits *relaxed*? Does capability actually expand with directive signal in live reasoning tasks, or does it remain re-weighting? Separate durable insight (scalar rewards are lossy) from perishable claim (expansion impossible).
(2) Surface the strongest *disagreement* in the last 6 months: does any recent work argue directive signal is *overrated* relative to simple negative RL or accept/reject gates? Name the paper; does it hold up under re-testing?
(3) Propose 2 research questions assuming the regime *has* moved: (a) If directive signal *does* expand capability in live agents, what minimum richness of state observation is required to extract it reliably? (b) Can rubric-as-gate + sparse directive signal outperform dense rewards on out-of-distribution reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

A reward score tells an AI how well it did — but the next state it lands in tells it how to do better.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8