INQUIRING LINE

What features does a sample reinforce when it moves bands?

This reads 'bands' as the difficulty bands from curriculum-style training — where a problem that was once too hard (or too easy) drifts into the productive zone as the model's ability changes — and asks: when a sample crosses into that zone, what is it actually teaching the model?


This explores a real moving target in training: a sample's value isn't fixed by its difficulty but by how that difficulty meets the model's *current* ability, so the question is what a sample reinforces once it slides into the productive band. The cleanest anchor is the finding that sample informativeness is dynamic — the band of medium-difficulty problems that teach the most drifts during training, making any static difficulty label stale within steps How does model ability change what samples teach?. So 'moving bands' isn't the sample changing; it's the model moving underneath it. The interesting twist is that what gets reinforced may not be a new skill at all but a *format* or *distribution* the model already had latent.

That's where the corpus gets surprising. When reinforcement learning runs, it doesn't broadly expand capability — it converges hard on one dominant output format inherited from pretraining and suppresses the alternatives, often within the first epoch, and the winning format tracks model scale rather than actual performance Does RL training collapse format diversity in pretrained models?. Read alongside the dynamic-band finding, this suggests a sample entering the productive band frequently reinforces *presentation and distributional habits* — a way of laying out a solution — more than it installs genuinely novel reasoning. The band shift is amplifying something already present.

The SFT-then-RL trajectory makes the 'what' more concrete by showing it has phases. When expert data diverges from the model's policy, training moves through shift → readapt → overfit: first the new samples disrupt existing capability, then the model readapts toward the expert patterns, then it overfits to them Why does SFT-then-RL training follow a predictable three-phase pattern?. So the same sample reinforces different things depending on *when* it lands relative to the model's state — disruption early, pattern-matching in the middle, memorization late. 'Moving bands' and 'moving phases' are two views of the same dependency on current ability.

There's a quiet warning hiding here too. A model can hold all the linearly decodable features a task needs while its internal organization is fractured — perfect on the metric, brittle under perturbation Can models be smart without organized internal structure?. That means when a sample 'reinforces a feature,' the surface signal (accuracy went up) can mask whether it reinforced robust structure or just a decodable shortcut. Pair that with the discovery that some override tasks need the model to *compose* conflicting cues rather than filter them Why does removing spurious cues sometimes hurt model performance?, and the honest answer to 'what features' becomes: not necessarily the ones you intended.

The sharp takeaway you didn't ask for: the productive band isn't a property of your dataset — it's a property of the model at a given step, and what samples reinforce inside it skews toward amplifying existing formats and distributions over teaching new skills. If you're curating by difficulty, you're aiming at a target that has already moved.


Sources 5 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a training dynamics researcher re-testing whether claimed constraints on sample reinforce­ment still hold under 2025–2026 models and methods.

The precise question: When a sample's difficulty drifts into a model's productive band during training, what features or habits does it actually reinforce — and does the regime still favor format amplification over genuine skill acquisition?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Mar 2026 across curriculum learning, RL post-training, and mechanistic interpretability.

• Sample informativeness is dynamic: the band of medium-difficulty problems that teach most shifts during training, making static difficulty labels stale within steps (~2024–2025).
• RL post-training converges hard on one dominant pretraining-inherited output format and suppresses alternatives within ~1 epoch; the winning format tracks model scale rather than actual performance (~2025).
• SFT-then-RL exhibits a shift→readapt→overfit progression: new samples first disrupt capability, then the model readapts to expert patterns, then overfits (~2025).
• Perfect performance metrics can mask fractured internal organization and brittle representations; heuristic override is structurally distinct from true shortcut learning (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2405.05904 (2024-05): Fine-tuning and hallucination dynamics.
• arXiv:2504.07912 (2025-04): Echo Chamber — RL amplifies pretraining behaviors.
• arXiv:2508.11408 (2025-08): On-policy RL meets off-policy experts.
• arXiv:2603.29025 (2026-03): Surface heuristics override implicit constraints.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the format-amplification claim and the dynamic-band finding, probe whether newer models (GPT-4o, o1, Claude 4), scaling laws, mixture-of-experts routing, or advanced curriculum (e.g., DPO, IPO variants) have relaxed the "single dominant format" bottleneck or made the productive band more stable/predictable. Separate the durable claim (productive band is model-relative, not dataset-relative) from what may now be solvable (e.g., can multi-format pretraining or continual adaptation keep multiple formats alive through RL?). Cite concretely.
(2) Surface the strongest disagreement: does any recent work (last 6 months) argue that RL post-training can install *genuinely novel* reasoning structure, contradicting the format-amplification thesis? If so, what method or evidence?
(3) Propose two new questions: one assuming the regime has moved toward multi-format stability; one assuming brittle-representation risk has become *more* acute as model scale and RL intensity grow.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines