INQUIRING LINE

Can random rewards improve reasoning models if pretraining is suitable?

This explores whether random or meaningless reward signals can still improve a model's reasoning — and the corpus says the answer hinges almost entirely on what the model already learned during pretraining.


This explores whether random or meaningless reward signals can still improve a model's reasoning, and the surprising answer from the corpus is: yes — but only for certain models, and only because of what their pretraining already planted. The cleanest evidence comes from work on spurious rewards Why do random rewards improve reasoning for some models but not others?, where Qwen2.5-Math gained 16–25% on MATH-500 from rewards that were random or even deliberately wrong — while Llama and OLMo, trained differently, got nothing. The reward wasn't teaching anything. It was nudging the model to surface a latent code-reasoning habit baked in during pretraining. Pretraining format, not reward correctness, decided what the optimization pressure could surface.

That reframes what reinforcement learning is even doing here. A broader look at reward-learning dynamics What does reward learning actually do to model reasoning? argues RLVR doesn't expand a model's capability boundary — it sharpens sampling efficiency within it. A single training example can suffice to 'activate' a strategy, and spurious rewards work nearly as well as correct ones precisely because the reasoning was already there to be found. The strongest version of this claim is that base models already contain the reasoning Do base models already contain hidden reasoning ability?: five unrelated techniques — RL steering, critique fine-tuning, decoding tweaks, feature steering, RLVR — all elicit reasoning sitting dormant in base-model activations. The bottleneck is elicitation, not acquisition. If reasoning is something you select rather than create, then it stops being mysterious that a noisy reward can flip the switch.

There's even a mechanistic clue about why so little signal is needed. Only about 20% of tokens — the high-entropy 'forking points' where the model genuinely decides where to go — carry the learning signal Do high-entropy tokens drive reasoning model improvements?. Training on just those matches full-gradient performance. If the levers that matter are this sparse and this concentrated, a crude or random reward only has to push on the right few decision points, which a well-pretrained model already wants to make.

The natural follow-on is to plant the reasoning earlier rather than fight to elicit it later. One line of work treats chain-of-thought as an exploratory action *during pretraining* Can chain-of-thought reasoning be learned during pretraining itself?, using log-likelihood improvement as a verifier-free reward and lifting math/science benchmarks ~19% — suggesting the 'suitable pretraining' the spurious-reward result depends on can be engineered deliberately. And if you do want real signal instead of noise, cheaper-than-human options exist: a model's own answer-span confidence can serve as the reward Can model confidence work as a reward signal for reasoning?, strengthening reasoning without labels or verifiers.

The thing you didn't know you wanted to know: 'random rewards work' isn't a paradox about reinforcement learning being magic — it's evidence that the reasoning was finished before the reward arrived. The reward is a key, not a teacher, and whether it fits the lock was decided during pretraining.


Sources 6 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems analyst. The question remains open: can random or misaligned rewards improve reasoning models, and if so, under what conditions and via what mechanism?

What a curated library found — and when (dated claims, not current truth):
Findings span April–October 2025. A library of recent work on RLVR and reasoning elicitation reports:
• Qwen2.5-Math gained 16–25% on MATH-500 from deliberately random/wrong rewards; Llama and OLMo showed no gain — suggesting pretraining format, not reward correctness, determines what RL can surface (~2025).
• RLVR does not expand capability boundaries but sharpens sampling efficiency within them; base models already contain latent reasoning waiting for elicitation, not acquisition (~2025).
• Only ~20% of tokens—high-entropy 'forking points'—carry meaningful learning signal in RLVR; training on just those matches full-gradient performance (~2025).
• Chain-of-thought treated as exploratory action during pretraining with log-likelihood reward lifts math/science benchmarks ~19% without verifiers (~2025).
• Model confidence (answer-span certainty) serves as intrinsic reward, strengthening reasoning without external labels (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.01939 (July 2025): High-Entropy Minority Tokens Drive Effective Reinforcement Learning
• arXiv:2510.01265 (Sept 2025): RLP: Reinforcement as a Pretraining Objective
• arXiv:2507.21931 (July 2025): Post-Training via RLVR from Self-Feedback
• arXiv:2510.18176 (Oct 2025): RLVR Traces in Math Domains (Local vs. Global Validity)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, determine whether post-October 2025 work on model scaling, longer-horizon training, multi-agent orchestration, or better mechanistic interpretability has relaxed or overturned it. Specifically: does the 20%-token bottleneck hold for models >100B params? Can reasoning truly *not* be acquired, only elicited? Where does the pretraining-format dependency still bind downstream reasoning, and where has it been circumvented?
(2) Surface the strongest recent work (last ~6 months) that contradicts the 'reasoning is latent, not learned' thesis or shows reward quality *does* matter even with weak pretraining.
(3) Propose 2 research questions that assume the regime may have shifted: one on whether synthetic intermediate supervision during pretraining can *create* rather than merely unlock reasoning, another on whether adversarially-misaligned rewards can now *degrade* performance on newer models in ways the 2025 library did not observe.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines