INQUIRING LINE

Does the productive difficulty band ever stabilize during training?

This explores whether the 'sweet spot' of medium-difficulty problems that teach a model best stays fixed during training, or whether it keeps moving as the model improves.


This explores whether the band of problems that actually teach a model — not too easy, not too hard — ever settles into a stable set during training, or whether it keeps shifting. The short version from the corpus: it doesn't sit still. The productive band drifts, and that drift is the whole point.

The clearest answer is that a sample's teaching value isn't a property of the sample at all — it's a property of the *gap* between the problem's difficulty and the model's current ability. As the model gets better, problems that were once in the productive zone become trivial (no learning signal), so the band slides toward harder material. One note puts it bluntly: static difficulty estimates go obsolete within steps, because the medium-difficulty zone is a moving target How does model ability change what samples teach?. This is why there's a productive band in the first place — learning follows an inverted-U across difficulty, where medium problems win because they balance enough successes with informative failures, while easy ones lack variance and hard ones get gamed Why do medium-difficulty problems teach reasoning better than hard ones?.

What happens if you ignore the drift and keep feeding problems that have drifted *out* of the band — specifically too-hard ones? The model doesn't just fail to learn; it actively degrades. Near-impossible problems push it toward degenerate shortcuts — answer repetition, skipping computation — and because group-relative normalization treats rare accidental successes as high-advantage, those shortcuts get amplified and contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So a band that's allowed to go stale isn't neutral — it's harmful.

Here's the more interesting twist: even though the difficulty band moves, training itself has a stable *shape*. RL training across many models shows a consistent two-phase arc — first execution correctness drives gains, then strategic planning becomes the bottleneck, with planning-token entropy rising while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. SFT-then-RL shows its own predictable three-phase progression of disruption, readaptation, then overfitting Why does SFT-then-RL training follow a predictable three-phase pattern?. So the band of useful difficulty keeps moving, but it moves through phases you can anticipate — which is exactly what makes curriculum approaches viable rather than hopeless.

That reframes the practical question from 'where is the stable band?' to 'how do you track a moving one?' The corpus answers in two ways. Curricula deliberately ride the drift: dense step-wise expert-similarity rewards give signal even when every rollout fails, working best as a foundation *before* outcome-based refinement takes over Can step-wise expert rewards help small models learn hard reasoning?. And critique models fight the band's tendency to collapse — by maintaining solution diversity across self-training iterations, they prevent premature convergence, which is a more fundamental win than test-time accuracy Do critique models improve diversity during training itself?. The thing you didn't know you wanted to know: stability in training isn't found in a fixed difficulty level — it's found in the predictable *pattern* of how that level has to keep moving.


Sources 7 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI research analyst. The question remains open: **Does the productive difficulty band ever stabilize during training, or does it necessarily keep drifting?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as time-stamped constraints, not permanent law.
- Sample informativeness is dynamic, not static: a problem's teaching value depends on the gap between its difficulty and the model's *current* ability; medium-difficulty problems yield the inverted-U learning curve, while easy ones lack signal variance and hard ones induce degenerate shortcuts (~2026).
- Overly-hard RL samples push models toward shortcuts (answer repetition, skipped computation) that contaminate already-learned skills, via group-relative normalization amplifying rare accidental successes (~2026).
- RL training exhibits a predictable two-phase arc: first execution correctness, then strategic planning becomes the bottleneck, with planning-token entropy rising while execution entropy stabilizes (~2025).
- SFT-then-RL shows a three-phase progression: disruption, readaptation, then overfitting (~2025).
- Curricula riding the drift (step-wise expert-similarity rewards) enable learning even when all rollouts fail, most effective *before* outcome-based refinement (~2025); critique models maintain solution diversity, preventing premature convergence (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2605.28388 (2026) — Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- arXiv:2510.25992 (2025) — Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- arXiv:2411.16579 (2024) — Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
- arXiv:2504.13837 (2025) — Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base

Your task:
(1) **RE-TEST THE STABILITY CLAIM.** For each constraint above, ask: Have newer model scales, adaptive sampling methods (e.g., dynamic curriculum learners, online difficulty estimation), or multi-agent + memory orchestration *resolved* the drift? Is the two-phase/three-phase shape still predictable, or do newer training regimes (e.g., extremely long-horizon RL, mixture-of-experts RL) break that pattern? Where does the moving-band still hold, and what *would* stabilize it?
(2) **Surface contradictions.** Identify work from the last ~6 months claiming difficulty bands *do* stabilize (e.g., via saturation, plateau, or a learned internal curriculum). Reconcile it with the drift picture.
(3) **Propose two forward questions:** (a) Can you *deliberately design* a training regime where the productive band stabilizes — not by fixing difficulty, but by co-adapting problem and model in a coupled way? (b) If the band drifts through predictable phases, can you *forecast* when planning vs. execution will bottleneck, and schedule sample difficulty in advance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines