INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What constrains reinforcement lear…›this inquiring line

Punishing an AI for wrong answers turns out to work just as well as rewarding right ones — and sometimes better.

Does negative reinforcement alone achieve what full RL training accomplishes?

This explores whether the 'punish wrong answers' half of RL training is doing most of the work — and whether the 'reward right answers' half adds anything full RL doesn't already get from suppression alone.

This explores whether the 'punish wrong answers' half of RL training is doing most of the work — and whether the 'reward right answers' half adds anything that suppression alone doesn't already deliver. The corpus gives a surprisingly clean answer: negative reinforcement, used by itself, consistently matches or even beats full PPO and GRPO Does negative reinforcement alone outperform full reinforcement learning?. The mechanism is the interesting part. Training only on negative samples pushes down the probability of wrong trajectories without piling probability mass onto a few favored right ones — so the model stays diverse and its Pass@k holds up across the whole spectrum. Positive-only reinforcement does the opposite: it concentrates the distribution and quietly degrades higher-k performance. In other words, the reward half can actively hurt the breadth that makes a model good at sampling its way to a correct answer.

This lines up with a separate look at what physically changes inside a model during RL. Training touches only 5–30% of parameters, in sparse-but-full-rank subnetworks that are nearly identical across random seeds Does reinforcement learning update only a small fraction of parameters?. And when researchers asked which direction those updates point, the dominant force was suppression of wrong trajectories rather than amplification of right ones What actually changes inside a model during RL training?. So the 'negative reinforcement alone' result isn't a clever trick layered on top of RL — it looks like an honest description of what RL was mostly doing all along.

There's a deeper reason this works, which surfaces when you ask where the correct behaviors come from in the first place. A cluster of findings argues that RL with verifiable rewards doesn't teach new reasoning — it surfaces strategies already latent in pretraining, bounded by the prior How does RL training reshape reasoning and what gets lost?. The starkest version: completely random or even incorrect rewards still boost Qwen-Math by 16–25%, while Llama and OLMo get nothing, because the gain comes from activating pretrained code-reasoning habits, not from the reward signal carrying real information Why do random rewards improve reasoning for some models but not others?. If the right answers are already in the model and just need to be reached, then your job is mostly to clear away the wrong paths — exactly what negative reinforcement does, and exactly why the 'positive' signal can be redundant or harmful.

But the clean story has a real boundary, and it's worth knowing before you conclude 'suppression is all you need.' When models are trained long enough on diverse, non-mathematical tasks with KL control and policy resetting, RL outperforms the base model at every Pass@k level and discovers reasoning strategies the base model genuinely could not produce Can reinforcement learning discover reasoning strategies base models cannot?. That's the regime where there's nothing latent to merely surface, so pure suppression would have nothing to work with. The reconciliation: on tasks the base model could already half-do, negative reinforcement captures most of full RL's value by pruning errors and preserving diversity — but on genuinely novel territory, the positive, exploratory pressure of full RL earns its keep.

The thread worth pulling, if you want to go further, is what gets quietly lost. RL tends to converge on a single dominant pretraining format within the first epoch and collapse the alternatives — and the winning format is chosen by model scale, not by which one actually performs best Does RL training collapse format diversity in pretrained models?. Negative reinforcement's edge is precisely that it resists this collapse. So the real question behind 'does negative reinforcement match full RL' may be: how much of full RL's apparent learning is actually diversity destruction you'd be better off without?

Sources 7 notes

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Show all 7 sources

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Invisible Leash: Why RLVR May Not Escape Its Origin2.59 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.57 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools2.57 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.50 match · arxiv ↗
The Art of Scaling Reinforcement Learning Compute for LLMs2.46 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs2.45 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example2.44 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.79 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether negative reinforcement alone achieves what full RL training accomplishes in LLMs. This question is STILL OPEN—treat the library's findings as dated claims (2024–2026), not current ground truth.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026:
• Negative reinforcement alone matches or exceeds full PPO/GRPO by suppressing wrong trajectories while preserving diversity; positive-only RL concentrates the distribution and degrades Pass@k (~2506.01347, 2025-06).
• RL updates touch only 5–30% of parameters in sparse, full-rank subnetworks; the dominant force is suppression of wrong trajectories, not amplification of right ones (~2505.11711, 2025-05).
• On verifiable-reward tasks, RL surfaces latent pretraining strategies rather than teaching new reasoning; even random/incorrect rewards boost some models (16–25% for Qwen-Math) while others gain nothing (~2507.14843, 2025-07).
• On genuinely novel tasks with diverse, non-mathematical data and KL control, full RL discovers strategies the base model could not produce, outperforming at every Pass@k (~2505.24864, ProRL, 2025-05).
• RL post-training converges on a single dominant pretraining format by model scale, not performance; negative reinforcement resists this collapse (~2504.07912, 2025-04).

Anchor papers (verify; mind their dates):
• arXiv:2506.01347 (2025-06) The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
• arXiv:2505.11711 (2025-05) Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
• arXiv:2505.24864 (2025-05) ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries
• arXiv:2507.14843 (2025-07) The Invisible Leash: Why RLVR May Not Escape Its Origin

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, probe whether newer model scales, in-context learning harnesses, multi-agent orchestration (e.g., tree-search with LLM judges), or post-hoc reranking have relaxed the boundary between "tasks where negative RL suffices" vs. "genuinely novel reasoning." Does the latency assumption still hold? Can larger models reach novel ground without positive RL? Separate the durable question (does negative RL underexplored?) from the perishable limitation (e.g., does diversity collapse only when reward density is high?). Cite what changed it.

(2) SURFACE THE STRONGEST DISAGREEMENT. Hunt for papers (last ~6 months) that contradict the claim that positive RL adds little on standard benchmarks, or that argue diversity loss is a feature, not a bug. Flag any work showing negative RL fails on emerging tasks (e.g., code, long-horizon planning, multi-step reasoning).

(3) PROPOSE 2 RESEARCH QUESTIONS that assume the regime may have shifted:
   – Question A: Can negative RL be made competitive on tasks requiring genuine exploration (vs. pruning error) by combining it with diversity-preserving mechanisms (e.g., entropy bonuses, mixture-of-experts routing)?
   – Question B: Does the "convergence to dominant format" problem disappear if you interleave negative RL with curriculum or active data selection?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Punishing an AI for wrong answers turns out to work just as well as rewarding right ones — and sometimes better.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8