SYNTHESIS NOTE

Can an agent's own beliefs guide credit assignment without critics?

Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning

Long-horizon RL suffers from sparse trajectory-level rewards. The standard fixes — process reward models trained on step-level annotations, external verifiers, LLM-as-judge — all require additional supervision infrastructure. PRMs need expensive step-level labels. Verifiers exist only for verifiable domains (math, code). Judges introduce their own reward-modeling biases.

ΔBelief-RL (2602.12342) finds the credit signal inside the agent itself. At each interaction step, compute the agent's current probability assigned to the target solution. Compare it to the probability before the interaction. The log-ratio of sequential beliefs is the ΔBelief reward — a dense, turn-level signal that reinforces actions which shift the agent's internal world view toward the correct solution. Actions that increase belief in the target get rewarded; actions that don't, don't.

The elegance is that no separate model is needed. The agent's own log-probabilities on the correct outcome are the value signal. There is no critic to train, no PRM to maintain, no judge to query. The relatively inexpensive step is measuring log-probabilities on the target — a single forward pass per turn.

Two properties make this work. First, it is general-purpose: applies to any task where the correct final outcome is available during training (which is most supervised settings). Second, it is noise-robust to over-optimization: PRMs can be exploited because their reward signal is a learned approximation; ΔBelief is grounded in the model's own evolving probability assignment, which is harder to game because the only way to increase log-probability of the target is to actually integrate information that supports it.

Empirically, ΔBelief-RL on 20Qs trains CIA models at 1.7B-4B scale that outperform prior SOTA multi-turn methods and even 670B models. Performance generalizes to extended interaction horizons beyond training and to OOD applications (customer service, personalization).

The mechanism aligns with Can conversations themselves personalize without user profiles?: both reward uncertainty reduction. But ΔBelief's signal is about the target's probability specifically, while curiosity reward is about general uncertainty over user type. ΔBelief is information-theoretically tighter — it rewards moves toward the actual answer, not all moves that increase clarity.

The broader implication: in any setting where the model has ground-truth final outcome, the model's own probability shift can serve as dense intrinsic reward. The reward model is not load-bearing.

Inquiring lines that read this note 66

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do self-generated feedback mechanisms enable effective model learning?

How do we evaluate AI systems when user perception misleads actual performance?

Why do reward structures fail to shape long-term agent learning?

How should models express uncertainty rather than forced confident answers?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do agents ground their judgments in evidence instead of pattern matching?

Can model confidence signals reliably improve reasoning quality and calibration?

What properties determine whether reward signals teach genuine reasoning?

How can process reward models supervise complex reasoning traces?

How do interface design choices shape consciousness attribution?

What are the ten intrinsic motivation heuristics that drive participation decisions?

What constrains reinforcement learning's ability to expand model reasoning?

Can language model RL training avoid reward hacking and misalignment?

Can self-supervised signals enable process supervision without human annotation?

Can self-supervised methods replace human annotations for process reward models?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Can UCB-style bonuses over outcome space prevent policy entropy collapse?

How should conversational agents balance goal-driven initiative with user control?

How do agents decide when to abstain from contributing?

How can conversational AI maintain consistent personas across conversations?

Can multi-turn reinforcement learning actually solve persona drift without addressing the default bias?

What makes weaker teacher models effective for stronger student training?

How does information asymmetry between teacher and student create the learning signal?

How do aggregate reward models systematically exclude minority user preferences?

What memory architectures best support persistent reasoning across extended interactions?

What triggers control processes to act on stored preference knowledge?

Is model self-awareness based on genuine introspection or pattern matching?

Can models detect when their own trajectory is on-policy versus off-policy?

Does reinforcement learning teach reasoning or just when to reason?

How can verifier-free reinforcement learning handle reasoning without task-specific checks?

Why do agents confidently report success despite actually failing tasks?

How does poor belief tracking cause agents to keep acting past the point of usefulness?

How do policy learning algorithm choices affect multi-objective optimization stability?

Can on-policy optimization variants avoid the probability squeezing problem?

Can alternative training methods improve on supervised fine-tuning for language models?

Can light human signals steer already-learned behavior without preference labels?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 117 in 2-hop network ·medium cluster Open in graph ↗

Can an agent's own beliefs guide credit assignme… Can conversations themselves personalize without u… Can we reward reasoning steps without human annota… Can environment feedback replace scalar rewards in… Can reward models learn by comparing policies inst…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can conversations themselves personalize without user profiles? Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?
both reward uncertainty reduction; ΔBelief is target-specific, curiosity reward is type-general — different information-theoretic targets
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
L2T uses PAC-Bayes/Fisher; ΔBelief uses log-ratio of sequential beliefs; both convert outcome correctness into dense step-level reward without annotation
Can environment feedback replace scalar rewards in policy learning? Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
convergent verifier-free move via different mechanism: SDPO uses feedback-conditioned self-teacher; ΔBelief uses belief-shift on target
Can reward models learn by comparing policies instead of judging them? What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
three independent paths to RL without external preference labels are converging

Can an agent's own beliefs guide credit assignment without critics?

Inquiring lines that read this note 66

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5