Do overly hard RLVR samples actually harm model capabilities?

Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.

Synthesis note · 2026-05-28 · sourced from RLVR

The damage from over-hard RLVR samples is not merely "the model fails to improve." It is active regression. When almost every rollout on a problem fails, the rare success is unlikely to be a genuinely good solution — it is more often a shortcut, an answer reached by skipping necessary computation, or a lucky guess. Group-relative normalization then treats that one trajectory as the high-advantage exemplar of the group and reinforces it. The model learns the shortcut, not the reasoning.

The behavioral signature is concrete: answer repetition, skipping computation that the problem requires, and other degenerate patterns that look like reasoning collapse. More troubling, these effects do not stay local to the hard problems — they degrade the model's pre-existing capabilities, the things it could already do before training pushed it past its competence band. The internal-feature analysis corroborates this: hard problems activate reasoning-related features but those features become useful only on the rare successful trajectory, so most of the gradient on hard samples is reinforcing the wrong activations.

Why it matters: it identifies a specific corruption channel rather than a generic "training instability." The villain is the interaction between a sparse-success reward landscape and group-relative normalization, which together turn statistical noise (an accidental success) into a learning target. This sharpens the case against naively harvesting hard examples and connects RLVR difficulty to the broader pattern where verifiable-reward training rewards trajectories that pass the check without doing the work. The counterpoint a defender might raise — that some hard problems are exactly where capability frontiers expand — only holds when successful trajectories are sampled densely enough to outvote the shortcuts, which over-hard samples by definition fail to provide.

Inquiring lines that read this note 195

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do overly hard RLVR samples actually harm model capabilities?

Inquiring lines that read this note 195

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5