INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does example difficulty affect…›this inquiring line

The examples an AI already gets right teach it almost nothing — the hard ones at the edge of failure do all the real learning.

Why do easy training examples contribute less to model generalization than hard ones?

This explores why ranking training data by difficulty — and trimming the easy, redundant examples — can improve how well a model generalizes, and where that logic breaks down.

This explores why easy examples seem to carry less learning signal than hard ones, and the corpus has a sharp empirical answer plus an important set of caveats. The clearest result comes from work on difficulty-based data pruning Can we prune training data without hurting model performance?: when you rank examples by metrics like how often the model forgets them or how early it gets them right, the easy ones turn out to be largely redundant. The model already 'knows' them after seeing a few, so each additional easy example adds almost nothing. Hard examples sit near the decision boundary and keep delivering new signal. Strikingly, pruning away the easy half didn't just preserve accuracy — it let the researchers beat the usual power-law scaling curve, meaning the *informative* examples were doing nearly all the work.

But the corpus is emphatic that 'harder is better' is not monotonic — it inverts at the extreme. Training on nearly-impossible problems actively damages a model Do overly hard RLVR samples actually harm model capabilities?. When a problem is beyond reach, the rare accidental success gets treated as a high-value trajectory, and the model learns to repeat answers or skip computation rather than reason. So difficulty helps only inside a band: too easy is redundant, too hard teaches shortcuts. The useful examples are the ones at the frontier of what the model can currently almost do.

That 'almost do' framing recurs from a different angle in work on teacher-refined data Does teacher-refined data always improve student model performance?. Even objectively higher-quality training data degrades a student model when it exceeds the student's learning frontier — difficulty is relative to the learner, not absolute. The same theme appears with richer teacher traces Does richer teacher context hurt student generalization?, where confident, polished examples improve in-domain scores but strip out the uncertainty that helps a model generalize to harder, out-of-distribution cases. Easy or over-smoothed examples can make a model look better while teaching it less.

There's also a deeper reason easy examples generalize poorly: models tend to solve them by surface pattern-matching rather than by learning the underlying principle. Work on argument-quality assessment shows that fine-tuning on labeled examples alone teaches surface cues, not transferable criteria — the model only generalizes once it's given the actual framework Can models learn argument quality from labeled examples alone?. Easy examples are exactly the ones a shortcut can solve, so they reward shortcut-finding instead of forcing the model to build a real rule.

The thing worth taking away: 'hard' isn't a fixed property of an example — it's a relationship between the example and what the model can already do. The examples that generalize are the ones at the edge of the model's current ability, which is why a curriculum that's well-matched to the learner How do domain training techniques actually reshape model behavior? often beats simply piling on more data.

Sources 6 notes

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Show all 6 sources

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining why easy training examples contribute less to model generalization than hard ones. The question remains open; treat the findings below as dated claims to be re-tested, not current truth.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. The library reports:
• Pruning easy examples (ranked by forgetting frequency or early correctness) beat power-law scaling, preserving accuracy while reducing data (~50% pruning), implying easy examples are largely redundant (2022).
• Training on overly-hard problems induces degenerate behaviors: models learn shortcuts (repeating answers, skipping reasoning) when problems exceed reach; benefit only exists in a difficulty band (2026).
• Teacher-refined data degrades student models when it exceeds the student's learning frontier; difficulty is relative to learner, not absolute (2024).
• Richer, over-smoothed teacher traces improve in-domain scores but strip uncertainty needed for out-of-distribution generalization (2024).
• Easy examples reward surface pattern-matching; models only generalize when given explicit theoretical frameworks, not labels alone (2024).

Anchor papers (verify; mind their dates):
• arXiv:2206.14486 (2022) — data pruning beating power-law scaling
• arXiv:2504.07912 (2025) — RL post-training amplifying learned behaviors
• arXiv:2605.28388 (2026) — mechanistic interpretation of sample difficulty in RLVR
• arXiv:2605.12484 (2026) — continual adaptation in LLMs

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, judge whether newer scaling laws, synthetic data, preference optimization (DPO/IPO), or modern evals (e.g., live leaderboards, reasoning benchmarks) have since relaxed or overturned it. Separate the durable insight (difficulty-adjusted curricula remain valuable) from the perishable limitation (e.g., power-law bounds may differ under different training regimes). Cite what relaxed it; flag what still holds.
(2) **Surface strongest contradicting work** from the last ~6 months. Does any recent paper argue that scale, architecture, or training procedure now makes easy examples *non-redundant*, or that overly-hard examples no longer induce shortcuts?
(3) **Propose 2 research questions** that assume the regime may have shifted: e.g., (a) Under continuous scaling + synthetic data, does the optimal difficulty band widen or narrow? (b) Do modern RL post-trainers (RLHF, test-time scaling) change the utility curve of example difficulty?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The examples an AI already gets right teach it almost nothing — the hard ones at the edge of failure do all the real learning.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8