Spurious Rewards: Rethinking Training Signals in RLVR

Paper · Source

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 16.4% (format reward), 24.6% (incorrect label), 24.4% (1-shot RL), and 26.5% (majority voting)—nearly matching the 28.8% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning—thinking in code without actual code execution—to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 66.7% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work.

Introduction. Reinforcement learning with verifiable rewards (RLVR) is highly effective in enhancing language model reasoning (Lambert et al., 2024; DeepSeek-Math, 2024; Zeng et al., 2025; Luo et al., 2025b). We show, counterintuitively, that RLVR can improve mathematical reasoning even with weak or flawed spurious rewards when applied to Qwen2.5-Math models (Yang et al., 2024a,b), a popular and performant model family used in the RLVR literature (Hu et al., 2025a; Yang et al., 2025; Wang et al., 2024; Guan et al., 2025; Zeng et al., 2025) (§2). For example, using incorrect labels for training results in 24.6% absolute accuracy gain on MATH-500, while using format or random rewards result in 16.4% and 21.4% gains, respectively. Strikingly, these spurious-reward gains are even comparable to the 28.8% gain from training on ground truth. We observe similar trends on more challenging math benchmarks such as AMC and AIME.

Discussion / Conclusion. Our research demonstrates that RLVR with weak or spurious rewards (format-only, random, and incorrect) improves reasoning in Qwen2.5-Math models largely by amplifying existing reasoning patterns. As one example, we find that RLVR encourages more frequent code reasoning—a capability already present in the pretrained model (e.g., Qwen2.5-Math-7B) that correlates with higher accuracy. Our observational experiments confirm that both code usage frequency and test accuracy increase during RLVR training across all reward settings uniquely for Qwen2.5-Math models. As further validation, we show that directly inducing code reasoning results in strong performance gains. Our findings have three main implications: base model pretraining significantly affects RLVR outcomes; even corrupted or spurious supervision can enhance reasoning when it triggers useful existing behaviors; and effects observed in one model family may not generalize to others. Our work highlights the importance of testing across multiple models with differing pretraining distributions when evaluating reinforcement learning techniques.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

What properties determine whether reward signals teach genuine reasoning?

What constrains reinforcement learning's ability to expand model reasoning?

How does memorization interact with learning and generalization?

How much RLVR improvement comes from benchmark data memorization?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can clean benchmarks reveal true RLVR reasoning gains?

Can language model RL training avoid reward hacking and misalignment?

Spurious Rewards: Rethinking Training Signals in RLVR

Synthesis notes that discuss concepts related to this paper 10

Lines of inquiry this paper opens 24