Does negative reinforcement alone achieve what full RL training accomplishes?
This explores whether the 'punish wrong answers' half of RL training is doing most of the work — and whether the 'reward right answers' half adds anything full RL doesn't already get from suppression alone.
This explores whether the 'punish wrong answers' half of RL training is doing most of the work — and whether the 'reward right answers' half adds anything that suppression alone doesn't already deliver. The corpus gives a surprisingly clean answer: negative reinforcement, used by itself, consistently matches or even beats full PPO and GRPO Does negative reinforcement alone outperform full reinforcement learning?. The mechanism is the interesting part. Training only on negative samples pushes down the probability of wrong trajectories without piling probability mass onto a few favored right ones — so the model stays diverse and its Pass@k holds up across the whole spectrum. Positive-only reinforcement does the opposite: it concentrates the distribution and quietly degrades higher-k performance. In other words, the reward half can actively hurt the breadth that makes a model good at sampling its way to a correct answer.
This lines up with a separate look at what physically changes inside a model during RL. Training touches only 5–30% of parameters, in sparse-but-full-rank subnetworks that are nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. And when researchers asked which direction those updates point, the dominant force was suppression of wrong trajectories rather than amplification of right ones What actually changes inside a model during RL training?. So the 'negative reinforcement alone' result isn't a clever trick layered on top of RL — it looks like an honest description of what RL was mostly doing all along.
There's a deeper reason this works, which surfaces when you ask where the correct behaviors come from in the first place. A cluster of findings argues that RL with verifiable rewards doesn't teach new reasoning — it surfaces strategies already latent in pretraining, bounded by the prior How does RL training reshape reasoning and what gets lost?. The starkest version: completely random or even incorrect rewards still boost Qwen-Math by 16–25%, while Llama and OLMo get nothing, because the gain comes from activating pretrained code-reasoning habits, not from the reward signal carrying real information Why do random rewards improve reasoning for some models but not others?. If the right answers are already in the model and just need to be reached, then your job is mostly to clear away the wrong paths — exactly what negative reinforcement does, and exactly why the 'positive' signal can be redundant or harmful.
But the clean story has a real boundary, and it's worth knowing before you conclude 'suppression is all you need.' When models are trained long enough on diverse, non-mathematical tasks with KL control and policy resetting, RL outperforms the base model at every Pass@k level and discovers reasoning strategies the base model genuinely could not produce Can reinforcement learning discover reasoning strategies base models cannot?. That's the regime where there's nothing latent to merely surface, so pure suppression would have nothing to work with. The reconciliation: on tasks the base model could already half-do, negative reinforcement captures most of full RL's value by pruning errors and preserving diversity — but on genuinely novel territory, the positive, exploratory pressure of full RL earns its keep.
The thread worth pulling, if you want to go further, is what gets quietly lost. RL tends to converge on a single dominant pretraining format within the first epoch and collapse the alternatives — and the winning format is chosen by model scale, not by which one actually performs best Does RL training collapse format diversity in pretrained models?. Negative reinforcement's edge is precisely that it resists this collapse. So the real question behind 'does negative reinforcement match full RL' may be: how much of full RL's apparent learning is actually diversity destruction you'd be better off without?
Sources 7 notes
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.