INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Why do reward structures fail to s…›this inquiring line

When an AI learns by watching its own confidence shift, does that beat rewarding it step-by-step for good reasoning?

How does belief-shift reward compare to curiosity-driven and process reward approaches?

This explores how belief-shift reward (an agent using changes in its own confidence as an intrinsic learning signal) stacks up against curiosity-driven exploration and process-reward methods that score reasoning step-by-step.

This explores how belief-shift reward compares to two other ways of generating a learning signal — curiosity-driven exploration and process rewards. The short version: the corpus is rich on belief-shift and process rewards, and treats them as part of a larger shift away from external reward models, but it doesn't actually hold a paper on curiosity-driven reward specifically — so that leg of the comparison is the thin one.

Belief-shift reward works by watching how an agent's own probability estimate of the right answer moves turn by turn; that movement *is* the reward, no critic network or separate scorer required Can an agent's own beliefs guide credit assignment without critics?. The striking framing comes from a synthesis note arguing that late-2025 RL is quietly converging on three interchangeable ways to drop the external reward model: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces the reward signal itself Can language models replace reward models with internal signals?. So belief-shift isn't a rival to process rewards so much as it targets a *different component* of the old pipeline — it kills the critic, not the step-scorer.

Process rewards attack the problem from the opposite end: instead of one number at the end, they score the reasoning along the way. The interesting twist is that you may not need to build a process reward model at all. Tree-structured rollouts can manufacture step-level signals just by comparing sibling branches of a search tree, turning a single outcome reward into process supervision for free Can tree structure alone convert outcome rewards into process supervision?. And when you do train a judge, making it *reason about* the reasoning (a generative judge) beats a classifier-style scorer, with far less training data Can judges that reason about reasoning outperform classifier rewards?. Both belief-shift and these process methods share a goal: denser, cheaper signal than a sparse final reward.

The deeper question lurking under all three is *what a reward signal can even carry*. One note argues that agent feedback splits into two orthogonal channels — evaluative ('how good was that?') and directive ('what should change?') — and that a scalar reward captures only the first Can scalar rewards capture all the information in agent feedback?. That's why models plateau on numerical rewards but break through when handed a natural-language critique explaining *why* they failed Can natural language feedback overcome numerical reward plateaus?. Belief-shift and process rewards are both still essentially evaluative — richer in *when* the signal arrives, not in *what kind* of information it is.

Worth knowing for the curious reader: there's reason to be skeptical that any of these reward schemes teach genuinely new abilities. Studies of verifiable-reward RL find it sharpens *sampling* toward solutions the base model could already reach rather than expanding the reasoning frontier — spurious rewards work nearly as well as correct ones, and base models can beat RL-trained ones at high sample counts What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. If that holds across reward types, then the belief-shift-vs-process debate is less about who learns more and more about who extracts existing capability most cheaply — which is exactly where belief-shift's no-extra-models design looks strongest.

Sources 8 notes

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Show all 8 sources

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reward Reasoning Model4.20 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin2.63 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.58 match · arxiv ↗
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?2.48 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.78 match · arxiv ↗
Intrinsic Credit Assignment for Long Horizon Interaction1.77 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.76 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning1.72 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: How do belief-shift reward, process rewards, and curiosity-driven approaches compare as learning signals for LLM reasoning?

What a curated library found — and when (dated claims, not current truth):
Findings span April 2025–February 2026. A library of 12 papers on verifier-free RL and reasoning rewards reports:

• Belief-shift reward (internal probability movement as dense intrinsic signal) eliminates the critic network; process rewards target step-level scoring; neither requires external reward models (~2025).
• Tree-structured rollouts convert outcome rewards into process supervision for free by comparing sibling branches; generative stepwise judges (reasoning *about* reasoning) beat classifier-style scorers on 10–100× less training data (~2025).
• Agent feedback decomposes into two orthogonal channels—evaluative ('how good?') and directive ('what should change?')—and scalar rewards capture only the first; natural-language critiques break RL plateaus (~2025).
• Verifiable-reward RL sharpens sampling toward solutions the base model could already reach rather than expanding reasoning capability; spurious rewards work nearly as well as correct ones (~2025).
• Curiosity-driven reward is absent from this library; the comparison leg on intrinsic curiosity is thin.

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (April 2025): Does RL really incentivize reasoning beyond base model?
• arXiv:2508.19229 (August 2025): StepWiser—generative judges for reasoning.
• arXiv:2509.25760 (September 2025): TruthRL via RL incentives.
• arXiv:2602.12342 (February 2026): Intrinsic credit assignment for long-horizon interaction.

Your task:

(1) RE-TEST EACH CONSTRAINT. For belief-shift vs. process rewards: do newer model scales, training regimes (e.g., constitutional AI, chain-of-thought refinement), or multi-agent orchestration (memory, tree-search caching) now show capability *expansion* rather than sampling sharpening? Has curiosity-driven exploration (e.g., uncertainty quantification, information gain) appeared in any reasoning-RL work since Feb 2026? Separate the durable question (signal design for reasoning) from the perishable claim (RL doesn't expand frontier).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any paper show RL-trained models *do* exceed base-model reasoning ceilings? Does intrinsic motivation (curiosity) outperform belief-shift or process rewards?

(3) Propose 2 research questions that assume the regime may have moved:
   – If RL plateaus on capability remain real, can belief-shift reward + natural-language feedback channels (evaluative + directive decomposed) together break the ceiling?
   – Does curiosity-driven RL on *exploratory* reasoning (open-ended problem-solving, not verification) escape the frontier-expansion penalty?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI learns by watching its own confidence shift, does that beat rewarding it step-by-step for good reasoning?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8