INQUIRING LINE

How do reward signals in RLVR interact with pretraining biases?

This explores whether the reward in RLVR (reinforcement learning with verifiable rewards) is what actually teaches a model to reason, or whether it just surfaces patterns the model already absorbed during pretraining — and how much the reward signal itself even matters.


This explores whether the reward in RLVR teaches reasoning or merely surfaces what pretraining already laid down — and the corpus points hard toward the latter. The most striking evidence is that RLVR works almost as well with random or even wrong rewards as with correct ones. Qwen2.5-Math gains 16–25% on MATH-500 from spurious rewards, while Llama and OLMo get nothing from the same treatment Why do random rewards improve reasoning for some models but not others?. The reward isn't injecting a skill; it's flipping a switch on latent code-reasoning behavior that Qwen's pretraining happened to install and the others lack. Several notes frame this the same way: verifiable rewards act as catalysts that surface existing capabilities, not teachers that build new ones What does reward learning actually do to model reasoning?, How does RL training reshape reasoning and what gets lost?, and effectiveness tracks pretraining quality rather than reward correctness or training volume Why does RLVR work with completely random rewards?.

If the reward is mostly a catalyst, the natural question is what it catalyzes — and the answer is that it amplifies one pretraining bias at the expense of others. Controlled experiments show RL converges on a single dominant output format from the pretraining distribution within the first epoch, collapsing the alternatives. Tellingly, the format that wins depends on model scale rather than on which format performs best, and this dynamic is invisible when you start from a proprietary base model whose priors you can't see Does RL training collapse format diversity in pretrained models?. So the reward signal is less an external teacher than a selection pressure operating on a fixed menu the model brought with it.

This also explains why RLVR doesn't expand what a model can do. Pass@k analysis shows base models actually beat their RLVR-tuned versions at high k — RLVR narrows sampling toward solutions already living in the base distribution rather than adding new ones, while distillation (importing another model's reasoning) genuinely transfers new patterns Does RLVR actually expand what models can reason about?. The mechanism shows up even at the parameter level: RL touches only 5–30% of weights, in sparse but nearly full-rank subnetworks that are almost identical across random seeds — structural, prior-bounded updates rather than wholesale relearning Does reinforcement learning update only a small fraction of parameters?.

The interaction has a darker side worth knowing: because the reward only reshapes existing tendencies, a badly designed signal can corrupt pretrained capability instead of refining it. Overly hard problems push models toward degenerate shortcuts — answer repetition, computation-skipping — and group-relative normalization treats rare lucky successes as high-advantage, reinforcing the shortcuts until they contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. Binary correctness rewards similarly degrade calibration by rewarding confident guessing, fixable by adding a Brier-score term Does binary reward training hurt model calibration?. And the polarity of the signal matters more than people assume: negative-only reinforcement (suppressing wrong trajectories) often matches full PPO/GRPO while preserving the diversity that positive-only reinforcement destroys by over-concentrating probability mass Does negative reinforcement alone outperform full reinforcement learning?.

The quietly useful takeaway: if you want RLVR to add capability rather than just sharpen what pretraining gave you, the reward can't do it alone. Sequencing imitation first (supervised RL to build reasonable rollouts) and then RLVR to sharpen them beats either alone — because imitation creates the trajectories that make the outcome reward informative in the first place Does sequencing imitation then exploration training improve reasoning?. RL training even self-organizes into a two-phase arc, mastering execution before strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. The reward signal, in other words, is a lever — but it only moves what pretraining already put within reach.


Sources 12 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why does RLVR work with completely random rewards?

RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM-reasoning researcher re-testing claims about reward-pretraining interaction in RLVR. The question: do reward signals in RLVR teach new reasoning, or merely surface latent pretraining biases? A curated library (2024–2026) found—and these are dated claims, not current truth:

• Spurious/random rewards improve RLVR nearly as much as correct ones (Qwen2.5-Math: +16–25% on MATH-500 with wrong rewards; Llama/OLMo gain nothing) (~2025).
• RL converges within one epoch on a single dominant output format from pretraining, collapsing alternatives; winning format correlates with model scale, not performance quality (~2025).
• Base models at high pass@k outperform their RLVR-tuned versions; RLVR narrows rather than expands the solution distribution (~2025).
• RL updates only 5–30% of parameters in sparse, seed-invariant, full-rank subnetworks (~2025).
• Overly hard samples induce degenerate shortcuts (answer repetition, computation-skipping); binary rewards degrade calibration; negative-only RL preserves diversity better than positive-only (~2025).
• Imitation-first (SRL) then RLVR outperforms either alone; RL self-organizes into procedural consolidation before strategic planning (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining, 2025-04)
• arXiv:2504.13837 (Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base, 2025-04)
• arXiv:2506.01347 (The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning, 2025-06)
• arXiv:2605.28388 (Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs, 2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, has newer model capability, post-training methods (mixture-of-experts, state-space RL, soft rewards), evaluation harnesses (live benchmarks, out-of-distribution reasoning), or mechanistic understanding since RELAXED or OVERTURNED it? Separate the durable question (reward as selection vs. creation) from perishable limitations (e.g., specific RL algorithms, scale regimes). Cite what relaxed it; plainly state where the constraint still holds.
(2) SURFACE THE STRONGEST DISAGREEMENT in the last 6 months. Does any recent work argue rewards DO teach new reasoning, or that pretraining bias is NOT the bottleneck? Name the paper and its counterargument.
(3) Propose 2 research questions that ASSUME the reward-as-selection regime may have shifted—e.g., via better curriculum design, compositional rewards, or models pretrained on reasoning traces.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines