How does belief-shift reward compare to curiosity-driven and process reward approaches?
This explores how belief-shift reward (an agent using changes in its own confidence as an intrinsic learning signal) stacks up against curiosity-driven exploration and process-reward methods that score reasoning step-by-step.
This explores how belief-shift reward compares to two other ways of generating a learning signal — curiosity-driven exploration and process rewards. The short version: the corpus is rich on belief-shift and process rewards, and treats them as part of a larger shift away from external reward models, but it doesn't actually hold a paper on curiosity-driven reward specifically — so that leg of the comparison is the thin one.
Belief-shift reward works by watching how an agent's own probability estimate of the right answer moves turn by turn; that movement *is* the reward, no critic network or separate scorer required Can an agent's own beliefs guide credit assignment without critics?. The striking framing comes from a synthesis note arguing that late-2025 RL is quietly converging on three interchangeable ways to drop the external reward model: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces the reward signal itself Can language models replace reward models with internal signals?. So belief-shift isn't a rival to process rewards so much as it targets a *different component* of the old pipeline — it kills the critic, not the step-scorer.
Process rewards attack the problem from the opposite end: instead of one number at the end, they score the reasoning along the way. The interesting twist is that you may not need to build a process reward model at all. Tree-structured rollouts can manufacture step-level signals just by comparing sibling branches of a search tree, turning a single outcome reward into process supervision for free Can tree structure alone convert outcome rewards into process supervision?. And when you do train a judge, making it *reason about* the reasoning (a generative judge) beats a classifier-style scorer, with far less training data Can judges that reason about reasoning outperform classifier rewards?. Both belief-shift and these process methods share a goal: denser, cheaper signal than a sparse final reward.
The deeper question lurking under all three is *what a reward signal can even carry*. One note argues that agent feedback splits into two orthogonal channels — evaluative ('how good was that?') and directive ('what should change?') — and that a scalar reward captures only the first Can scalar rewards capture all the information in agent feedback?. That's why models plateau on numerical rewards but break through when handed a natural-language critique explaining *why* they failed Can natural language feedback overcome numerical reward plateaus?. Belief-shift and process rewards are both still essentially evaluative — richer in *when* the signal arrives, not in *what kind* of information it is.
Worth knowing for the curious reader: there's reason to be skeptical that any of these reward schemes teach genuinely new abilities. Studies of verifiable-reward RL find it sharpens *sampling* toward solutions the base model could already reach rather than expanding the reasoning frontier — spurious rewards work nearly as well as correct ones, and base models can beat RL-trained ones at high sample counts What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. If that holds across reward types, then the belief-shift-vs-process debate is less about who learns more and more about who extracts existing capability most cheaply — which is exactly where belief-shift's no-extra-models design looks strongest.
Sources 8 notes
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.