Can negative reinforcement alone match full RL performance on domain tasks?
This explores whether training a model only on its wrong answers — pushing the model away from incorrect trajectories — can match what you get from full reinforcement learning that rewards both right and wrong, on real domain tasks.
This explores whether training only on what a model gets *wrong* — suppressing bad answers rather than reinforcing good ones — can match full RL on domain tasks. The short answer from the corpus is a surprising yes: negative-only training consistently matches or even exceeds full PPO and GRPO across a range of tasks Does negative reinforcement alone outperform full reinforcement learning?. The reason is more interesting than the result. Positive-only reinforcement keeps concentrating probability mass onto the answers the model already finds, which quietly kills diversity and degrades performance when you sample many times (higher-k). Negative reinforcement does the opposite — it prunes the wrong trajectories while leaving the model's range of correct approaches intact. So "only punish mistakes" isn't a weaker version of RL; it's a different pressure that preserves exploration.
That lands inside a broader pattern the corpus keeps surfacing: how you treat successes versus failures should be asymmetric. SkillRL makes the same move at the level of memory — it stores successful episodes as concrete demonstrations but failures as abstracted lessons, and that asymmetry reaches state-of-the-art while using far less context than processing everything uniformly Should successful and failed episodes be processed differently?. Two different research lines independently find that wrong answers and right answers carry different kinds of information and shouldn't be fed back the same way.
Why negative signal is so potent connects to a known failure of naive positive rewards. Binary correctness rewards — "+1 if right" — actively reward confident guessing, because nothing penalizes a confident wrong answer, which wrecks calibration Does binary reward training hurt model calibration?. A method built around suppressing the incorrect is, in effect, attacking exactly the trajectories that positive-only schemes leave unpunished. There's a thematic rhyme here with entropy dynamics too: structured domains tend to collapse output entropy under training, and that collapse is what damages a model's flexibility Does training order reshape how models handle different task types?. Negative reinforcement's diversity-preserving behavior is essentially an antidote to that collapse.
Worth knowing where the ceiling is, though. Other corpus work suggests pure scalar signals — positive or negative — eventually plateau because a number tells the model *that* it failed, not *why*. Natural-language critiques can break those plateaus by supplying the missing "how to improve" information Can natural language feedback overcome numerical reward plateaus?, and richer reward shaping (rewarding explanation quality, not just answers) can embed domain knowledge more deeply than token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. So the honest framing is: negative-only RL matches full RL on the metrics tested, and does so more cheaply and with better diversity — but "matching full RL" inherits full RL's own limits, which the next wave of work is trying to escape with feedback that's qualitative rather than just signed.
Sources 6 notes
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.