INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What constrains reinforcement lear…›this inquiring line

If AI training still works with random rewards, a signal that's right 93% of the time is more than enough.

How does 93% reward reliability compare to other RL noise sources?

This explores reward-signal noise in reinforcement learning — reading '93% reliability' as a reward that's correct ~93% of the time — and asks whether a 7% error rate is large or small next to the other things that perturb RL training.

This explores reward-signal noise in RL — a reward that's right roughly 93% of the time — and where that 7% of unreliability ranks among the other ways RL training gets noisy. The corpus's most surprising answer is that for a certain class of models, reward noise barely registers at all. RLVR can improve reasoning even when the reward signal is *random or actively wrong*, because it doesn't teach new skills — it catalyzes reasoning behavior already latent from pretraining Why does RLVR work with completely random rewards?. By that logic, a 93%-reliable reward is comfortably inside the tolerance band; the model would still gain at 50%.

But the corpus immediately complicates the percentage as the wrong axis to measure on. Whether noise matters depends on the *model*, not the noise rate: Qwen2.5-Math gains 16–25% from spurious rewards by surfacing latent code-reasoning, while Llama and OLMo gain nothing from the same signal Why do random rewards improve reasoning for some models but not others?. And the robustness can be an artifact — on contaminated benchmarks random rewards 'work' through memorization, but on clean held-out tests only genuinely correct rewards help; random and inverse rewards degrade performance Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So the same 7% noise is invisible in one setting and fatal in another.

The deeper move is that the *structure* of a reward error matters far more than its frequency. A reward can be 100% reliable on accuracy and still systematically corrupt the model: binary correctness rewards never penalize confident wrong answers, so they reliably degrade calibration regardless of how often they're 'right' Does binary reward training hurt model calibration?. The fix isn't a cleaner signal but a differently-shaped one — a Brier-score term, or a three-way reward that makes abstention learnable instead of forcing a guess Can three-way rewards fix the accuracy versus abstention problem?. Direction matters too: training on *only* negative signals (suppressing wrong trajectories) can match or beat full RL, because positive-only reinforcement collapses diversity Does negative reinforcement alone outperform full reinforcement learning?. A 93%-reliable reward whose 7% errors are confidently-wrong positives is worse than one whose errors are missed negatives.

Now set that against the other noise sources RL actually contends with, and the reward channel looks almost quiet. Sampling itself is noisy in ways determinism hides — zero temperature gives you the *same* draw repeatedly, not a *reliable* one; consistency across 100 repetitions still leaves you holding one sample from the distribution Does setting temperature to zero actually make LLM outputs reliable?. Cross-rollout variance is large enough that it can be repurposed as a training signal in its own right, weighting tokens and filtering degenerate queries Can one statistical measure serve dual purposes in RL training?. Meanwhile the update itself is strikingly *stable*: across seven algorithms and ten model families, RL touches only 5–30% of parameters, and which parameters is nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?.

The thing you didn't know you wanted to know: a 93% reliability figure is comparing on the wrong dimension. RL's robustness to reward noise comes from RLVR sharpening an existing distribution rather than expanding it Does RLVR actually expand what models can reason about? — which is also why some newer methods drop the trained reward signal entirely, replacing it with the policy's own self-judgment Can language models replace reward models with internal signals?. If you can throw the reward model away and still train, then 7% error in one was never the bottleneck. The bottleneck is whether the error is shaped to push the model toward overconfidence, toward collapsed diversity, or toward memorization — and a clever design uses rewards as gates rather than dense scores precisely to keep noise from being hackable Can rubrics and dense rewards work together without hacking?.

Sources 12 notes

Why does RLVR work with completely random rewards?

RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Show all 12 sources

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Spurious Rewards: Rethinking Training Signals in RLVR4.32 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains4.21 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning4.15 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin3.40 match · arxiv ↗
Reward Reasoning Model3.28 match · arxiv ↗
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination2.50 match · arxiv ↗
Escaping the Verifier: Learning to Reason via Demonstrations2.48 match · arxiv ↗
Reinforcement Learning with Rubric Anchors2.48 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-in-LLMs analyst. The question: Does a 93%-reliable reward signal pose a meaningful bottleneck in RL training, or is robustness to reward noise higher than the percentage suggests?

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2024–Oct 2025. A synthesis library reports:
- RLVR catalyzes latent reasoning from pretraining rather than teaching new skills; it tolerates reward signals down to 50% reliability, making 93% comfortably safe (2025).
- Robustness is model-dependent, not noise-rate-dependent: Qwen2.5-Math gains 16–25% from spurious rewards; Llama and OLMo gain nothing from identical signals (2025).
- 93% reliability on *contaminated benchmarks* may be noise-invisible due to memorization; on held-out tests, only correct rewards help; random/inverse rewards degrade performance (2025).
- Reward *structure* (shape, directionality) matters more than frequency: binary correctness rewards systematically degrade calibration regardless of reliability; negative-only RL matches full RL by suppressing diversity collapse (2025).
- RL updates touch only 5–30% of parameters in sparse, full-rank subnetworks; update stability is high across algorithms and model families (2025).

Anchor papers (verify; mind their dates):
- arXiv:2504.13837 (Apr 2025): Does RL expand reasoning beyond base-model capability?
- arXiv:2506.01347 (Jun 2025): Token-level rewards + rubric gates vs. dense scoring.
- arXiv:2507.14843 (Jul 2025): RLVR as latent-capacity amplifier, not capability expander.
- arXiv:2510.18176 (Oct 2025): Memorization vs. valid reasoning in math domains.

Your task:
(1) RE-TEST the 93% threshold. For each constraint above—latent-skill catalysis, model-dependency, structure-over-frequency, memorization risk—judge whether scaling, newer reward designs (e.g., outcome supervision, process rewards, self-play), training harnesses, or evals since Oct 2025 have relaxed or inverted it. Separate durable questions (e.g., Does reward shape matter?) from perishable limits (e.g., Is 93% sufficient?). Cite what changed the boundary.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. Does any recent result claim robust RL requires higher signal fidelity, or that the 7% error regime is model-agnostic after all?
(3) Propose 2 research questions that *assume* the regime has moved: e.g., If structure > frequency, what is the optimal reward shape for a given model family? If memorization is the risk, does test-time verification or online filtering remove it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If AI training still works with random rewards, a signal that's right 93% of the time is more than enough.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8