How does 93% reward reliability compare to other RL noise sources?
This explores reward-signal noise in reinforcement learning — reading '93% reliability' as a reward that's correct ~93% of the time — and asks whether a 7% error rate is large or small next to the other things that perturb RL training.
This explores reward-signal noise in RL — a reward that's right roughly 93% of the time — and where that 7% of unreliability ranks among the other ways RL training gets noisy. The corpus's most surprising answer is that for a certain class of models, reward noise barely registers at all. RLVR can improve reasoning even when the reward signal is *random or actively wrong*, because it doesn't teach new skills — it catalyzes reasoning behavior already latent from pretraining Why does RLVR work with completely random rewards?. By that logic, a 93%-reliable reward is comfortably inside the tolerance band; the model would still gain at 50%.
But the corpus immediately complicates the percentage as the wrong axis to measure on. Whether noise matters depends on the *model*, not the noise rate: Qwen2.5-Math gains 16–25% from spurious rewards by surfacing latent code-reasoning, while Llama and OLMo gain nothing from the same signal Why do random rewards improve reasoning for some models but not others?. And the robustness can be an artifact — on contaminated benchmarks random rewards 'work' through memorization, but on clean held-out tests only genuinely correct rewards help; random and inverse rewards degrade performance Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So the same 7% noise is invisible in one setting and fatal in another.
The deeper move is that the *structure* of a reward error matters far more than its frequency. A reward can be 100% reliable on accuracy and still systematically corrupt the model: binary correctness rewards never penalize confident wrong answers, so they reliably degrade calibration regardless of how often they're 'right' Does binary reward training hurt model calibration?. The fix isn't a cleaner signal but a differently-shaped one — a Brier-score term, or a three-way reward that makes abstention learnable instead of forcing a guess Can three-way rewards fix the accuracy versus abstention problem?. Direction matters too: training on *only* negative signals (suppressing wrong trajectories) can match or beat full RL, because positive-only reinforcement collapses diversity Does negative reinforcement alone outperform full reinforcement learning?. A 93%-reliable reward whose 7% errors are confidently-wrong positives is worse than one whose errors are missed negatives.
Now set that against the other noise sources RL actually contends with, and the reward channel looks almost quiet. Sampling itself is noisy in ways determinism hides — zero temperature gives you the *same* draw repeatedly, not a *reliable* one; consistency across 100 repetitions still leaves you holding one sample from the distribution Does setting temperature to zero actually make LLM outputs reliable?. Cross-rollout variance is large enough that it can be repurposed as a training signal in its own right, weighting tokens and filtering degenerate queries Can one statistical measure serve dual purposes in RL training?. Meanwhile the update itself is strikingly *stable*: across seven algorithms and ten model families, RL touches only 5–30% of parameters, and which parameters is nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?.
The thing you didn't know you wanted to know: a 93% reliability figure is comparing on the wrong dimension. RL's robustness to reward noise comes from RLVR sharpening an existing distribution rather than expanding it Does RLVR actually expand what models can reason about? — which is also why some newer methods drop the trained reward signal entirely, replacing it with the policy's own self-judgment Can language models replace reward models with internal signals?. If you can throw the reward model away and still train, then 7% error in one was never the bottleneck. The bottleneck is whether the error is shaped to push the model toward overconfidence, toward collapsed diversity, or toward memorization — and a clever design uses rewards as gates rather than dense scores precisely to keep noise from being hackable Can rubrics and dense rewards work together without hacking?.
Sources 12 notes
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.