Are different reward signal sources substitutable in verifier-free RL?
This explores whether the different sources of reward in verifier-free RL — a model judging itself, its own shifting beliefs, its confidence, even random noise — are interchangeable, or whether each does something distinct.
This explores whether the different sources of reward in verifier-free RL — a model judging itself, its own shifting beliefs, its confidence, even random noise — are interchangeable, or whether each does something distinct. The corpus suggests a surprising answer in two layers: at the architectural level, several reward sources really are substitutable, but at the level of what they teach, the substitution only works because the reward isn't doing the heavy lifting you'd assume.
The clearest case for substitutability comes from a late-2025 convergence where three independent reward sources each replace a different RLHF component Can language models replace reward models with internal signals?. A model judging its own answers pairwise can stand in for the reward model; its belief-shift toward a solution can stand in for the critic; rich self-feedback can replace the explicit reward signal entirely. You can see the individual pieces working on their own: an agent's log-ratio of belief in the target answer gives dense per-turn credit with no critic network at all Can an agent's own beliefs guide credit assignment without critics?, and a model's confidence in its own answer span can rank reasoning traces without any external verifier, while also undoing the calibration damage that RLHF usually causes Can model confidence work as a reward signal for reasoning?. Different signals, same job.
But the deeper reason these sources are swappable is unsettling: in a lot of RLVR, the reward barely matters. Spurious rewards with zero correlation to correct answers still improve reasoning — but only for models like Qwen2.5-Math whose pretraining already hid the relevant skill, and not at all for Llama or OLMo Why do random rewards improve reasoning for some models but not others?. The reward is acting as a catalyst that surfaces pretrained behavior, not a teacher building new capability What does reward learning actually do to model reasoning? How does RL training reshape reasoning and what gets lost?. That's also why RLVR doesn't push reasoning past the base model's boundaries — it narrows sampling toward solutions already in the distribution rather than expanding what's solvable Does RLVR actually expand what models can reason about?, and the updates it makes touch only a structured 5–30% of parameters Does reinforcement learning update only a small fraction of parameters?. If the signal is mostly flipping a switch that's already wired, of course many switches do the trick.
Where substitutability breaks down is on *what the reward penalizes*, not where it comes from. Here the shape of the signal is decisive and not interchangeable at all. Binary correctness rewards provably wreck calibration because they never punish a confident wrong answer — and the fix is a specific extra term, the Brier score, not just any second signal Does binary reward training hurt model calibration?. A ternary reward that separates correct answers, hallucinations, and abstentions makes "I don't know" learnable in a way binary rewards structurally can't Can three-way rewards fix the accuracy versus abstention problem?. Negative-only reinforcement preserves answer diversity and Pass@k, while positive-only reinforcement collapses it Does negative reinforcement alone outperform full reinforcement learning?. And decomposing a fuzzy goal into a verifiable checklist beats a single holistic score Can breaking down instructions into checklists improve AI reward signals?.
So the takeaway you might not have expected: the *origin* of the reward (self-judge vs. belief-shift vs. confidence vs. noise) is largely fungible, because in capability-activation regimes the reward is a trigger rather than a teacher — but the *structure* of the reward (binary vs. ternary, positive vs. negative, holistic vs. decomposed) is not fungible at all, because that's what actually decides which behaviors survive training. Verifier-free RL frees you from needing an external grader; it does not free you from designing what the grade means.
Sources 12 notes
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.