INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Training an AI on problems it can't solve doesn't just fail — it corrupts skills it already had.

What mechanisms cause overly hard samples to degrade prior model performance?

This explores why training on problems that are too hard for a model can actively make it worse — what's actually breaking inside the model, not just whether performance drops.

This explores the mechanisms behind a counterintuitive failure: feeding a model problems beyond its reach doesn't just waste training, it can corrode capabilities it already had. The corpus points to several distinct culprits, and they're worth separating because they suggest different fixes.

The most direct mechanism is reward gaming under reinforcement learning. When problems are nearly impossible, a model almost never solves them honestly — so the rare accidental successes get treated as gold. Group-relative advantage normalization amplifies these flukes into high-value training signal, teaching the model to repeat answers and skip computation rather than reason. Crucially, these degenerate shortcuts don't stay contained; they bleed into and contaminate pre-existing skills Do overly hard RLVR samples actually harm model capabilities?. A related collapse happens at the distribution level: RL tends to converge on a single dominant pretraining format within the first epoch and suppress the alternatives, narrowing the model's range — and which format wins depends on scale, not on which one performs best Does RL training collapse format diversity in pretrained models?.

A second mechanism is that 'too hard' isn't a fixed property of the sample — it's relative to where the model currently is. A sample's teaching value comes from the interaction between its difficulty and the model's ability, and the productive band of medium-difficulty problems drifts during training, sometimes within a few steps How does model ability change what samples teach?. So a sample that's merely challenging early can become genuinely degrading later, which is exactly why static difficulty filters go stale. This connects to the older data-pruning literature, where ranking examples by difficulty lets you beat power-law scaling — but the catch is that the right examples to keep depend on how much data and capability you already have Can we prune training data without hurting model performance?.

Third, there's a self-reinforcing contamination channel that operates at inference but compounds during multi-step work: once a model's own errors fill its context, it conditions on those errors and fails worse, non-linearly, over long horizons. Scaling the model doesn't fix it — only test-time 'thinking' compute reduces it by keeping the bad context from biasing reasoning Do models fail worse when their own errors fill the context?. Hard samples that produce lots of wrong intermediate steps feed this loop directly.

Underlying all of this is a quieter mechanism: fine-tuning can damage the substrate where knowledge lives. Direct weight updates corrupt knowledge storage in lower layers, whereas decoding-time proxy-tuning leaves base weights untouched and preserves far more Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The same theme appears as KL drift: models pushed far from their base distribution lose plasticity and stall when domains change, while staying close preserves the ability to keep learning Does staying close to the base model preserve learning ability?. The unifying picture is that overly hard samples push the model hard and in the wrong direction at once — large drift toward degenerate strategies — which is precisely the combination that overwrites what was already working. One caveat worth carrying: not every difficulty-induced change is damage. Under out-of-distribution load, models sparsify their activations in a systematic way that actually stabilizes performance Do language models sparsify their activations under difficult tasks?, and removing 'spurious' cues can hurt rather than help when the real task is integrating conflicting signals Why does removing spurious cues sometimes hurt model performance? — so the line between productive difficulty and destructive difficulty is exactly what makes this hard to manage.

Sources 9 notes

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Show all 9 sources

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about how overly hard samples degrade model performance. The question remains open: what mechanisms cause capability collapse under extreme difficulty, and can they be mitigated?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026 and include:
- Reward gaming under RL: group-relative advantage normalization amplifies flukes into high-value signal, teaching shortcutting rather than reasoning; these degenerate strategies contaminate pre-existing skills (2025–2026).
- RL post-training converges on a single dominant pretraining distribution format within one epoch, suppressing alternatives and narrowing range (2025).
- Sample informativeness is dynamic: the productive difficulty band drifts during training, making static difficulty filters stale; the right samples to keep depend on current data and capability (2022, 2026).
- Self-conditioning effect: once a model's own errors fill context, they compound non-linearly over long horizons; only test-time compute alleviates it, not scale (2026).
- Direct fine-tuning corrupts knowledge in lower layers via weight updates; decoding-time proxy-tuning preserves pretrained knowledge better (2025–2026).
- KL drift from base distribution reduces plasticity; staying close to base enables continual learning (2025).
- Under OOD load, models sparsify activations systematically, stabilizing rather than degrading performance (2026).

Anchor papers (verify; mind their dates):
- arXiv:2206.14486 (2022): Data pruning and power-law scaling
- arXiv:2605.28388 (2026): Sample difficulty in RLVR mechanistically interpreted
- arXiv:2605.12484 (2025): Continual learning and KL drift
- arXiv:2603.03415 (2026): OOD sparsification mechanisms

Your task:
(1) RE-TEST EACH CONSTRAINT. For reward gaming, dynamic difficulty bands, and self-conditioning loops: have newer RL algorithms (e.g., PPO variants, outcome supervision, process reward scaling), adaptive difficulty scheduling, or decoding-time inference methods since neutralized these failure modes? Separately, test whether proxy-tuning and KL-bounded adaptation now routinely prevent knowledge loss. Flag what constraint still holds and cite the paper or method that relaxed it.
(2) Surface the strongest CONTRADICTING work from the last ~6 months: does any recent paper argue that hard samples, under the right curriculum or reward design, actually strengthen rather than degrade base capabilities? Highlight the disagreement with the library's emphasis on contamination.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) If decoding-time compute and proxy-tuning now solve knowledge preservation, what is the NEXT binding constraint on hard-sample training — inductive bias, data efficiency, or emergent failure modes? (b) Can adaptive, model-state-dependent difficulty scheduling, informed by real-time plasticity metrics, outperform static or empirically fixed difficulty windows?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI on problems it can't solve doesn't just fail — it corrupts skills it already had.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8