INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does example difficulty affect…›this inquiring line

The 'sweet spot' difficulty for AI training is a moving target: problems that taught it yesterday are useless once the model outgrows them.

How does the optimal difficulty band shift as the model's capabilities improve during training?

This explores how the 'sweet spot' difficulty of training problems doesn't stay fixed — as a model gets better, the problems that teach it the most keep moving, and the corpus has a surprising amount to say about why.

This explores how the optimal difficulty band shifts as a model's abilities grow during training — and the short answer is that it's a moving target that drifts faster than most training setups account for. The core insight is that a problem's teaching value isn't a property of the problem at all. It's a property of the *relationship* between the problem's difficulty and what the model can currently do. A sample that was richly informative at step 100 can become useless or even harmful by step 200, because the model has outgrown it How does model ability change what samples teach?.

Where does the band sit at any given moment? The corpus points consistently to the middle. Learning gains follow an inverted-U across difficulty: medium-hard problems teach best because they mix enough successes to give a usable signal with enough failures to be informative, while easy problems have no variance to learn from and brutally hard ones produce almost no successes Why do medium-difficulty problems teach reasoning better than hard ones?. As the model improves, that medium zone slides upward — yesterday's hard problem becomes today's productive-medium, and yesterday's medium becomes too easy to bother with. The implication is that static difficulty labels go stale within steps, so any curriculum that fixes difficulty up front is calibrating to a model that no longer exists.

The stakes for getting this wrong are higher than just wasted compute. Feeding a model problems that sit *above* its current band doesn't just fail to help — it actively damages capabilities the model already had. On near-impossible problems, rare accidental successes get treated as high-value trajectories by group-relative advantage normalization, which reinforces degenerate shortcuts like answer-repetition and skipped computation, and those shortcuts then contaminate previously sound reasoning Do overly hard RLVR samples actually harm model capabilities?. So the upper edge of the band isn't a soft ceiling you can safely overshoot; crossing it is corrosive. The same logic shows up in knowledge distillation: teacher-refined data that exceeds the student's current learning frontier degrades performance even when it's objectively higher quality, so students should filter refinements against their own ability rather than chase the best available signal Does teacher-refined data always improve student model performance?.

The more interesting wrinkle is that the band may not even be one-dimensional. Training tends to move through phases, which means *which kind* of difficulty matters shifts too. RL training reliably runs through a first phase where execution correctness is the bottleneck and a second phase where strategic planning becomes the limiting skill — so the right kind of challenge early (get the steps right) differs from the right kind later (plan better) Does RL training follow a predictable two-phase learning sequence?. There's also a curriculum angle: dense, step-wise expert-similarity rewards can keep small models learning on problems that would otherwise be all-failure — effectively widening the bottom of the usable band before handing off to sparse outcome rewards once the model is strong enough to succeed on its own Can step-wise expert rewards help small models learn hard reasoning?.

The quiet thread running under all of this: chasing the difficulty band aggressively costs you plasticity. Models that drift far from their base distribution lose the ability to keep learning new tasks, while staying close to base preserves the room to adapt later Does staying close to the base model preserve learning ability?. So the real design problem isn't just tracking the band as it rises — it's tracking it without burning the model's capacity to keep climbing.

Sources 7 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Show all 7 sources

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how optimal training difficulty shifts as model capability grows. The question remains open: does the difficulty band truly move as predicted, and what forces now govern its motion?

What a curated library found — and when (dated claims, not current truth):
Findings span May 2024–May 2026. A library of curated arXiv work identified:
• Sample informativeness is dynamic: a problem's teaching value depends on the interaction between task difficulty and current model capability; medium-hard problems yield the best learning signal (inverted-U effect) (~2026).
• As models improve, the optimal difficulty band drifts upward; static difficulty labels become stale within training steps (~2026).
• Overly-hard samples actively damage prior capabilities via group-relative reward normalization, reinforcing shortcuts like answer-repetition and skipped computation (~2026).
• Teacher-refined instruction data exceeding student capability degrades performance; students should filter refinements against their own learning frontier (~2025).
• RL training exhibits two phases: early bottleneck is execution correctness, later bottleneck is strategic planning; the *kind* of difficulty that helps shifts between phases (~2026).
• Dense step-wise expert-similarity rewards widen the usable band for small models before handing off to sparse outcome rewards (~2025).
• Lower KL drift from base distribution preserves plasticity and enables stronger continual adaptation (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2605.28388 (May 2026): Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- arXiv:2510.25992 (Oct 2025): Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- arXiv:2605.12484 (May 2026): Learning, Fast and Slow: Towards LLMs That Adapt Continually
- arXiv:2507.14783 (Jul 2025): Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Your task:
(1) RE-TEST THE BAND HYPOTHESIS. For each constraint above — the inverted-U effect, upward drift, degenerate-behavior risk, two-phase dynamics, plasticity trade-off — search for newer models, curriculum methods, dynamic sampling strategies (e.g., online importance weighting, adaptive batch selection), or orchestration patterns (memory, multi-agent decomposition, hierarchical RL) that may have RELAXED or OVERTURNED these findings. Separate the durable insight (difficulty tracking matters) from the perishable claim (the band moves as a single unit, or drift above the band is always harmful). Cite what resolved each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper show static difficulty works fine, or that models *benefit* from impossible-hard samples under certain reward structures, or that KL constraints don't preserve plasticity?
(3) Propose 2 research questions that assume the regime may have shifted:
  – How do modern adaptive sampling algorithms (e.g., loss-weighted, uncertainty-driven, or learned curricula) eliminate the stale-label problem? Do they measure band drift?
  – Under multi-task or open-ended RL, does the band concept break down? Is there a task-specific band, or one unified band across domains?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The 'sweet spot' difficulty for AI training is a moving target: problems that taught it yesterday are useless once the model outgrows them.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8