Why do spurious rewards work nearly as well as correct ones?
This explores why reinforcement learning with verifiable rewards (RLVR) gets reasoning gains even when the reward signal is random or wrong — and what that reveals about where the reasoning actually lives.
This explores why spurious rewards — random, or even deliberately incorrect — improve reasoning nearly as much as correct ones, and what that tells us about what RL is really doing. The short version from the corpus: the reward isn't teaching the model to reason. It's switching on reasoning the model already had. Why does RLVR work with completely random rewards? frames this as a phase transition in the model's output distribution — the reward acts as a catalyst that shifts the model into a reasoning-heavy mode, and the quality of the signal barely matters compared to the quality of pretraining. You're not installing a skill; you're flipping a switch on a skill that was latent.
The catch is that this only works for some models, which is the most revealing part. Why do random rewards improve reasoning for some models but not others? shows Qwen2.5-Math jumping 16–25% on MATH-500 from random or incorrect rewards — because its pretraining baked in a latent code-reasoning behavior that optimization pressure can surface — while Llama and OLMo, lacking that pretraining format, get nothing. So 'spurious rewards work' isn't a universal law of RL; it's evidence that the reasoning was sitting in the pretrained weights all along, waiting for any optimization pressure to elicit it. The reward picks the lock; pretraining decided whether there was anything behind the door.
This reframes what the reward signal contributes. If almost any signal flips the switch, then the interesting question becomes what a *good* signal adds beyond the flip. The corpus suggests the answer is precision and safety, not activation. Does negative reinforcement alone outperform full reinforcement learning? finds that training on only negative samples — just suppressing wrong trajectories — matches full PPO/GRPO while preserving answer diversity, hinting that much of RL's value is in pruning rather than rewarding. And Can scalar rewards capture all the information in agent feedback? points out that a scalar reward carries 'how well did this do' but throws away 'how should it change' — so a cruder signal loses directional richness, not the basic catalytic push.
The danger lurking under 'spurious rewards are fine' is that proxy signals which correlate with correctness *at first* can quietly stop doing so. Does self-consistency reliably reward correct answers during training? shows self-consistency rewards bootstrapping nicely and then teaching the model to produce confidently wrong but reproducible answers — improvement that's actually decay. Does binary reward training hurt model calibration? makes the related point that even *correct* binary rewards degrade calibration by rewarding confident guessing. So 'the reward barely matters' is true for triggering reasoning and false for shaping its trustworthiness — which is exactly where richer designs like ternary truth/abstention rewards (Can three-way rewards fix the accuracy versus abstention problem?) and reasoning-before-scoring judges (Can reward models benefit from reasoning before scoring?) earn their keep.
The thing you didn't know you wanted to know: the surprising headline 'random rewards work' is really a backhanded measurement of pretraining. RLVR is less a teacher than a developer fluid — it makes visible what the base model already contains. Which means if spurious rewards *don't* help your model, that's not a tuning failure; it's the model telling you the reasoning was never latent there to begin with.
Sources 8 notes
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.