INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What constrains reinforcement lear…›this inquiring line

When AI trains itself, how it scores its own answers barely matters — which suggests the reward isn't really the teacher.

Are different reward signal sources substitutable in verifier-free RL?

This explores whether the different sources of reward in verifier-free RL — a model judging itself, its own shifting beliefs, its confidence, even random noise — are interchangeable, or whether each does something distinct. The corpus suggests a surprising answer in two layers: at the architectural level, several reward sources really are substitutable, but at the level of what they teach, the substitution only works because the reward isn't doing the heavy lifting you'd assume.

The clearest case for substitutability comes from a late-2025 convergence where three independent reward sources each replace a different RLHF component Can language models replace reward models with internal signals?. A model judging its own answers pairwise can stand in for the reward model; its belief-shift toward a solution can stand in for the critic; rich self-feedback can replace the explicit reward signal entirely. You can see the individual pieces working on their own: an agent's log-ratio of belief in the target answer gives dense per-turn credit with no critic network at all Can an agent's own beliefs guide credit assignment without critics?, and a model's confidence in its own answer span can rank reasoning traces without any external verifier, while also undoing the calibration damage that RLHF usually causes Can model confidence work as a reward signal for reasoning?. Different signals, same job.

But the deeper reason these sources are swappable is unsettling: in a lot of RLVR, the reward barely matters. Spurious rewards with zero correlation to correct answers still improve reasoning — but only for models like Qwen2.5-Math whose pretraining already hid the relevant skill, and not at all for Llama or OLMo Why do random rewards improve reasoning for some models but not others?. The reward is acting as a catalyst that surfaces pretrained behavior, not a teacher building new capability What does reward learning actually do to model reasoning? How does RL training reshape reasoning and what gets lost?. That's also why RLVR doesn't push reasoning past the base model's boundaries — it narrows sampling toward solutions already in the distribution rather than expanding what's solvable Does RLVR actually expand what models can reason about?, and the updates it makes touch only a structured 5–30% of parameters Does reinforcement learning update only a small fraction of parameters?. If the signal is mostly flipping a switch that's already wired, of course many switches do the trick.

Where substitutability breaks down is on *what the reward penalizes*, not where it comes from. Here the shape of the signal is decisive and not interchangeable at all. Binary correctness rewards provably wreck calibration because they never punish a confident wrong answer — and the fix is a specific extra term, the Brier score, not just any second signal Does binary reward training hurt model calibration?. A ternary reward that separates correct answers, hallucinations, and abstentions makes "I don't know" learnable in a way binary rewards structurally can't Can three-way rewards fix the accuracy versus abstention problem?. Negative-only reinforcement preserves answer diversity and Pass@k, while positive-only reinforcement collapses it Does negative reinforcement alone outperform full reinforcement learning?. And decomposing a fuzzy goal into a verifiable checklist beats a single holistic score Can breaking down instructions into checklists improve AI reward signals?.

So the takeaway you might not have expected: the *origin* of the reward (self-judge vs. belief-shift vs. confidence vs. noise) is largely fungible, because in capability-activation regimes the reward is a trigger rather than a teacher — but the *structure* of the reward (binary vs. ternary, positive vs. negative, holistic vs. decomposed) is not fungible at all, because that's what actually decides which behaviors survive training. Verifier-free RL frees you from needing an external grader; it does not free you from designing what the grade means.

Sources 12 notes

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Show all 12 sources

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains5.16 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR4.36 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin4.34 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning4.20 match · arxiv ↗
Reward Reasoning Model3.34 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?2.64 match · arxiv ↗
Learning to Reason without External Rewards2.58 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools2.56 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a verifier-free RL researcher re-examining whether different reward signal sources remain substitutable given recent model and method advances. The question remains open: what makes some reward sources interchangeable and others not?

What a curated library found — and when (dated claims, not current truth):
Findings span September 2024–February 2026. A curated library identified three independently substitutable reward sources (self-judgment, belief-shift, confidence) that each replace an RLHF component (~2025). However, the deeper finding: reward substitutability works *only* because most reward signals act as capability catalysts surfacing pretraining, not teachers (~2025). Spurious rewards with zero correlation to correctness still improve reasoning in Qwen2.5-Math but not Llama/OLMo. RL updates touch only 5–30% of parameters in sparse subnetworks. Where substitutability *breaks*: reward *structure* (binary vs. ternary, positive vs. negative, holistic vs. decomposed) is non-fungible—binary rewards provably degrade calibration; ternary rewards make abstention learnable; negative-only RL preserves diversity while positive-only collapses it (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-05): Does RL Really Incentivize Reasoning Beyond the Base Model
• arXiv:2505.11711 (2025-05): Sparse Subnetwork Updates in LLM RL
• arXiv:2506.01347 (2025-06): Negative Reinforcement Effectiveness
• arXiv:2507.18624 (2025-07): Checklists vs. Reward Models

Your task:
(1) RE-TEST THE CONTRADICTION: The library claims reward *origin* is fungible but *structure* is decisive. For each claimed substitutable source (self-judge, belief-shift, confidence, noise), test whether new model families (Gemini-2, o1-variants, multimodal), training methods (online RL, process reward models), or evals (mathematics, code, longer horizons) have since shown: (a) origin *does* matter in ways the library missed, or (b) structure constraints are *less* decisive than claimed. Flag where the catalyst-vs.-teacher distinction still holds.
(2) Surface the strongest *disagreement* from the last six months. Has any recent work claim reward sources are genuinely non-substitutable, or that structure is less constraining than the library suggests? Cite it.
(3) Propose two new research questions that assume the regime may have shifted: e.g., do foundation models trained with native online RL from the start relax the 5–30% parameter bound? Can adaptive reward structures (meta-learned or dynamically reweighted) overcome the binary-calibration failure?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When AI trains itself, how it scores its own answers barely matters — which suggests the reward isn't really the teacher.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8