SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Why do medium-difficulty problems teach reasoning better than hard ones?

Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.

Synthesis note · 2026-05-28 · sourced from RLVR
What does reward learning actually do to model reasoning?

It is tempting to assume harder training problems teach more — that pushing the model against the limit of its ability is where reasoning improves. RLVR does not behave that way. Difficulty-wise and one-sample analysis reveals an inverted-U: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, while overly hard problems provide weak learning signals and can actively degrade performance.

The mechanism runs through group-relative advantage. Easy problems are mostly solved, so within-group reward variance is low and the relative-advantage signal is small. Overly hard problems are mostly failed, so they too produce weak relative-advantage signals — and worse, the rare accidentally-rewarded trajectory (a shortcut, an incomplete computation that lands on the right answer) gets amplified by group-relative normalization into a biased update. Medium-difficulty problems sit where the model succeeds often enough to learn from contrast but fails often enough that success is informative — the regime where advantage estimation has the most signal.

Why it matters: this is a curriculum claim with teeth. It says the standard instinct to harvest hard examples for RLVR is counterproductive without intervention, and it explains why in terms of the advantage estimator rather than vague "too hard to learn." The practical move is difficulty-adaptive: either filter toward the medium band or repair hard samples (the paper proposes backward-reasoning reformulation and feature-guided signals to raise reward density). The counterpoint is that "medium difficulty" is defined relative to the model's current capability — so the productive band moves as training proceeds, which is the seam where this static finding meets the dynamic-informativeness problem.

Inquiring lines that use this note as a source 12

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 141 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

sample difficulty has a non-monotonic effect on rlvr where medium-difficulty problems yield the strongest most stable gains