INQUIRING LINE

Why does medium difficulty outperform both easy and hard RLVR training samples?

This explores the 'inverted-U' pattern in RLVR (reinforcement learning from verifiable rewards) — why problems of moderate difficulty teach reasoning better than problems that are too easy or too hard.


This explores why medium-difficulty problems hit a sweet spot in RLVR training, and the corpus is unusually consistent on it: learning value follows an inverted-U curve. The core mechanism is the *advantage signal* — RLVR learns from the contrast between a model's successes and failures on the same problem. Easy samples produce almost all successes, so there's no variance and nothing to learn from. Hard samples produce almost all failures. Medium-difficulty problems are the only band where the model succeeds often enough and fails often enough that the gradient carries information Why do medium-difficulty problems teach reasoning better than hard ones?.

The surprising part is that hard samples aren't just neutral — they're actively harmful. When a model almost never solves a problem, the rare accidental success gets treated by group-relative normalization as a hugely high-advantage trajectory, so the model over-learns whatever produced that lucky win: answer repetition, skipping computation, degenerate shortcuts. Worse, those shortcuts then bleed into capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So the 'too hard' side of the curve isn't a missed opportunity — it's a contamination risk.

What makes this more than a tuning tip is that the productive band *moves*. A sample's difficulty isn't a fixed property of the problem — it's the interaction between the problem and the model's current ability. As the model improves, problems that were medium become easy, and the informative band drifts within a few training steps, which makes any static difficulty filter obsolete almost immediately How does model ability change what samples teach?. This is the same lesson that shows up far outside math: training empathetic agents, researchers found that moderately demanding, well-aligned environments beat maximally challenging ones, because maximum difficulty pushes the model outside the space it can actually explore Do harder training environments always produce better empathetic AI agents?.

There's a deeper through-line here about exploration. RLVR tends to exploit what already works rather than explore, which can quietly narrow a model's problem-solving range — a failure mode called capability boundary collapse Why does RLVR training narrow a model's problem solving ability?. Hard samples accelerate that collapse by rewarding shortcuts; medium samples keep the model in the region where genuine exploration still pays off. One fix the corpus points to is sequencing: run imitation-style supervised RL first to build reasonable rollouts, *then* apply RLVR — because the imitation phase manufactures the partial successes that make outcome rewards informative in the first place Does sequencing imitation then exploration training improve reasoning?.

If you want to follow the thread further, the corpus also complicates what 'works' even means here: some RLVR gains turn out to be memorization on contaminated benchmarks rather than real reasoning Does RLVR success on math benchmarks reflect genuine reasoning improvement?, and reasoning *activation* can be separated from benchmark improvement entirely Can genuine reasoning activation coexist with contaminated benchmarks?. Most strikingly, even random or spurious rewards can boost some models by surfacing latent behavior from pretraining — which suggests the difficulty curve isn't only about the data, but about what optimization pressure a given model is primed to respond to Why do random rewards improve reasoning for some models but not others?.


Sources 9 notes

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Do harder training environments always produce better empathetic AI agents?

RLVER research shows moderately demanding, well-aligned training environments produce better empathetic agents than maximally challenging configurations. Overly difficult setups push models outside their explorable space, causing instability rather than growth.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing claims about sample difficulty in RLVR (reinforcement learning for reasoning). The question: does the medium-difficulty sweet spot still hold, or have newer models, methods, and evaluation practices shifted the regime?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable:
• Medium-difficulty samples outperform easy (low variance, no gradient signal) and hard (degenerate shortcuts, over-learning luck) via an inverted-U advantage curve (2024–2025).
• Hard samples actively contaminate capability by rewarding spurious shortcuts and accelerating capability boundary collapse, not merely wasting capacity (2025).
• Sample difficulty is dynamic—it shifts as model ability improves, making static difficulty filters obsolete within training steps (2025).
• Supervised RL (imitation) → RLVR sequencing outperforms either alone because pretraining-phase rollouts enable informative outcome rewards (2025).
• RLVR gains on contaminated benchmarks are primarily memorization, separable from reasoning activation; spurious rewards can still improve performance via latent pretraining behaviors (2025–2026).

Anchor papers (verify; mind their dates):
• 2402.05808 (Training via Reverse Curriculum RL, 2024)
• 2504.13837 (Does RL Really Incentivize Reasoning Beyond Base? 2025)
• 2507.10532 (Memorization vs. Reasoning, Data Contamination, 2025)
• 2605.28388 (Mechanistic Interpretation of Sample Difficulty, 2026)

Your task:
(1) RE-TEST the inverted-U constraint: Have scaling (o1-class reasoning models, test-time compute), novel RLVR methods (outcome filtering, uncertainty-weighted sampling, adaptive curricula), or harnesses (multi-agent RL, long-horizon planning) since relaxed the hard-sample penalty or extended the productive band? Does the difficulty curve still hold for code, creative, or long-horizon tasks, or only math? Separate durable claim (difficulty matters) from perishable one (inverted-U is universal).
(2) Surface the strongest *contradicting* or *superseding* work from late 2025–mid 2026: any papers showing hard-sample benefits, static difficulty filters that don't degrade, or evidence the memorization/reasoning split makes the difficulty framing obsolete?
(3) Propose two research questions that assume the regime may have shifted: (a) If test-time scaling (longer rollouts, verification) dominates, does offline sample difficulty still predict online learning curves? (b) Can adaptive, model-state-aware difficulty assignment (not static) salvage the hard-sample regime?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines