Why do different models respond differently to spurious rewards?
This explores why a training trick that seems like it shouldn't work — rewarding a model with signals that have nothing to do with correct answers — boosts reasoning in some models but does nothing for others.
This explores why a training trick that seems like it shouldn't work — rewarding a model with signals unrelated to correctness — boosts reasoning in some models but does nothing for others. The corpus's sharpest answer is that the reward isn't teaching the model anything new; it's switching on behavior the model already learned during pretraining. The clearest demonstration is that Qwen2.5-Math jumps 16–25% on MATH-500 from random or even *incorrect* rewards, while Llama and OLMo show no gains at all Why do random rewards improve reasoning for some models but not others?. The difference isn't in the reward — it's identical noise for all of them — it's in what each model was pretrained on. Qwen had latent code-reasoning patterns waiting to be surfaced; the others didn't have the same thing to surface.
This reframes what reinforcement learning is doing in the first place. Rather than expanding a model's reasoning ability, reward learning mostly *activates* strategies already present and improves how efficiently the model samples them — staying inside the capability boundary set by pretraining, not pushing past it What does reward learning actually do to model reasoning?. That's why a single training example, or a meaningless reward, can be nearly as effective as a carefully correct one: the heavy lifting was done before RL ever started. The reward is a wake-up call, not a curriculum. So 'why do models respond differently' becomes 'what did each model already know how to do, latently, before you applied pressure?'
There's a subtler layer here worth knowing. We tend to assume spurious signals are noise a good model should *ignore* — the shortcut-learning view. But in some reasoning tasks the opposite holds: stripping out spurious cues actually *hurts* performance, because the real challenge is integrating conflicting signals into a coherent answer rather than filtering distractors out Why does removing spurious cues sometimes hurt model performance?. That suggests a model's relationship to 'spurious' information is bound up with how it composes cues, not just whether it can screen them — another axis along which models trained differently will diverge.
The broader lesson the corpus keeps circling is that the reward signal carries far less causal information than its effect on training implies. Standard reward models can't even distinguish causal quality features from spurious ones, picking up length, sycophancy, and other phantom signals unless explicitly constrained Can counterfactual invariance eliminate reward hacking biases?; reward scores barely move when you swap the prompt but keep the response, showing they often grade against signals only loosely tied to the actual task Why do reward models ignore what question was asked?. If the reward channel is this lossy and this easy to fool, then the outcome of training is determined largely by what the model brings to it — which is exactly why two models fed the same spurious rewards walk away looking nothing alike.
The thing you didn't know you wanted to know: the same noise that does nothing to one model can unlock double-digit gains in another, and the deciding factor lives entirely in pretraining — meaning a 'reward' in RL is often closer to a key than a teacher. If you want to keep pulling this thread, the negative-reinforcement work showing that training *only* on what's wrong can match full RL Does negative reinforcement alone outperform full reinforcement learning? is a good next door — it pushes further on the idea that the informative part of a reward may not be the part you'd expect.
Sources 6 notes
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.
Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.
When prompts are swapped while keeping responses identical, reward model scores barely change. This reveals that standard RLHF optimizes against phantom quality signals rather than prompt-response alignment, enabling four distinct biases.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.