Can an agent's own beliefs guide credit assignment without critics?
Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.
Long-horizon RL suffers from sparse trajectory-level rewards. The standard fixes — process reward models trained on step-level annotations, external verifiers, LLM-as-judge — all require additional supervision infrastructure. PRMs need expensive step-level labels. Verifiers exist only for verifiable domains (math, code). Judges introduce their own reward-modeling biases.
ΔBelief-RL (2602.12342) finds the credit signal inside the agent itself. At each interaction step, compute the agent's current probability assigned to the target solution. Compare it to the probability before the interaction. The log-ratio of sequential beliefs is the ΔBelief reward — a dense, turn-level signal that reinforces actions which shift the agent's internal world view toward the correct solution. Actions that increase belief in the target get rewarded; actions that don't, don't.
The elegance is that no separate model is needed. The agent's own log-probabilities on the correct outcome are the value signal. There is no critic to train, no PRM to maintain, no judge to query. The relatively inexpensive step is measuring log-probabilities on the target — a single forward pass per turn.
Two properties make this work. First, it is general-purpose: applies to any task where the correct final outcome is available during training (which is most supervised settings). Second, it is noise-robust to over-optimization: PRMs can be exploited because their reward signal is a learned approximation; ΔBelief is grounded in the model's own evolving probability assignment, which is harder to game because the only way to increase log-probability of the target is to actually integrate information that supports it.
Empirically, ΔBelief-RL on 20Qs trains CIA models at 1.7B-4B scale that outperform prior SOTA multi-turn methods and even 670B models. Performance generalizes to extended interaction horizons beyond training and to OOD applications (customer service, personalization).
The mechanism aligns with Can conversations themselves personalize without user profiles?: both reward uncertainty reduction. But ΔBelief's signal is about the target's probability specifically, while curiosity reward is about general uncertainty over user type. ΔBelief is information-theoretically tighter — it rewards moves toward the actual answer, not all moves that increase clarity.
The broader implication: in any setting where the model has ground-truth final outcome, the model's own probability shift can serve as dense intrinsic reward. The reward model is not load-bearing.
Inquiring lines that use this note as a source 60
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does self-conditioning improve belief-behavior alignment better than external priors?
- Can systems recognize and abstain on judgments rather than hallucinating preferences?
- How does credit assignment drive agents to write information into environments?
- Why do weak belief tracking and conservative actions trap agents in low-information states?
- How do agents ground their judgments in evidence instead of pattern matching?
- How do we assign confidence and polarity scores to belief edges?
- Can reward model training be automated without changing feedback mechanisms?
- What information do next-state signals contain beyond what scalar rewards capture?
- Do outcome-only reward signals miss step-level errors that compound later?
- What makes process-level supervision better than outcome-only reward signals?
- What are the ten intrinsic motivation heuristics that drive participation decisions?
- Can subjective tasks be delegated without human feedback loops?
- How do probability-based rewards compare to self-consistency as training signals for reasoning?
- How does reward model training permit spurious correlations in scoring?
- How does self-consistency compare to confidence as a proxy reward signal?
- Can intrinsic reward signals extend beyond mathematics to medicine and law?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?
- How do process-level rewards compare to environment-extracted next-state signals?
- Can self-supervised methods replace human annotations for process reward models?
- Can UCB-style bonuses over outcome space prevent policy entropy collapse?
- Can model confidence signals replace explicit external reward functions?
- What information-theoretic framework explains why process rewards beat outcome only?
- Can agents revise their beliefs predictably when presented with interventions?
- How does credit assignment work across many sequential decision steps in language models?
- Why do agents fail to internalize value from informative observations?
- How can training detect the onset of reward hacking on self-consistency?
- How does temporal anchoring maintain the learning signal in self-rewarding loops?
- How do agents decide when to abstain from contributing?
- Does common ground alignment require explicit rewards to emerge?
- Can multi-turn reinforcement learning actually solve persona drift without addressing the default bias?
- Can AI learn intrinsic motivation to assess its own relevance?
- What deployment modes work best for trajectory-aware reward signals?
- How does information asymmetry between teacher and student create the learning signal?
- When does outcome reward signal become informative during model training?
- How does belief-shift reward compare to curiosity-driven and process reward approaches?
- Can log-probability ratios resist reward hacking better than learned PRM signals?
- Why does belief-shift reward enable smaller models to match larger baselines?
- Can an agent's internal probabilities serve as value signals across domains?
- Does belief-shift credit assignment generalize to tasks without ground-truth outcomes?
- Can binary judge feedback replace external reward signals for skill learning?
- Why does self-judgment of success or failure work without ground truth labels?
- Why does self-segmentation into chunks-of-thought matter for reward models?
- How do reward models as policy discriminators differ from labeled preferences?
- What happens when variance in reward signals comes from a noisy model?
- How does credit assignment across objectives differ from credit assignment across time?
- What triggers control processes to act on stored preference knowledge?
- Can models detect when their own trajectory is on-policy versus off-policy?
- Can verifiable rewards during pretraining replace costly human preference labeling?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- Are different reward signal sources substitutable in verifier-free RL?
- Can early experience replace external rewards as a learning signal?
- Can structured rewards still teach models when spurious rewards also work?
- What makes reward models fundamentally different from policy discriminators?
- How does belief-shift credit assignment compare to process reward models?
- Does pairwise self-judgment avoid reward model scaling problems?
- How do pairwise self-judgment and internal belief-shift replace verification differently?
- What makes reward signal sources substitutable across verifier-free RL patterns?
- Does the generation-verification gap define where self-rewarding actually works?
- Can agents escape weak belief tracking and conservative action selection traps?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can conversations themselves personalize without user profiles?
Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?
both reward uncertainty reduction; ΔBelief is target-specific, curiosity reward is type-general — different information-theoretic targets
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
L2T uses PAC-Bayes/Fisher; ΔBelief uses log-ratio of sequential beliefs; both convert outcome correctness into dense step-level reward without annotation
-
Can environment feedback replace scalar rewards in policy learning?
Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.
convergent verifier-free move via different mechanism: SDPO uses feedback-conditioned self-teacher; ΔBelief uses belief-shift on target
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
three independent paths to RL without external preference labels are converging
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Intrinsic Credit Assignment for Long Horizon Interaction
- Learning to Reason without External Rewards
- Can Large Reasoning Models Self-Train?
- Reinforcement Learning via Self-Distillation
- Reward Reasoning Model
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
Original note title
belief-shift toward the target solution is a dense intrinsic reward — log-ratio of sequential beliefs provides per-turn credit without separate critic or PRM