INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does example difficulty affect…›this inquiring line

The 'best' training examples can actually hurt an AI — quality is relative to what the learner is ready to absorb.

Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?

This explores whether spreading training examples across a range of difficulty levels beats simply picking the 'best' or hardest examples — and the corpus reframes the question by showing that 'high quality' only means something relative to what the learner can absorb.

This explores whether spreading training examples across a range of complexity levels beats picking only the 'highest-quality' ones — and the collection's sharpest move is to challenge what 'high quality' even means. Several notes converge on the same surprising idea: quality is not an absolute property of an example, but a relationship between the example and the learner. Teacher-refined data that is objectively better can actively *degrade* a student model when it sits beyond the student's learning frontier, so the fix is for students to filter refinements against their own statistical profile and keep only what's compatible Does teacher-refined data always improve student model performance?. Push this to the extreme and you get the failure case for 'just pick the hardest, richest examples': training on near-impossible problems causes models to learn degenerate shortcuts — answer repetition, skipped computation — that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So selecting purely for difficulty or 'quality' can be worse than worthless.

Sources 7 notes

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can careful selection of 78 demos outperform massive training datasets?

LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.

Show all 7 sources

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.72 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.71 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.67 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models1.65 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.63 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.61 match · arxiv ↗
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks0.90 match · arxiv ↗
Large Language Model Reasoning Failures0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Does selecting examples from multiple complexity levels outperform selecting only high-quality examples? A curated library (2023–2026) found — and when (dated claims, not current truth):

• Teacher-refined data objectively superior in isolation can degrade student models when beyond their learning frontier; students must filter refinements against their own statistical profile (2024).
• Training on near-impossible problems causes models to learn degenerate shortcuts (answer repetition, skipped computation) that contaminate already-mastered skills (2025).
• RL post-training amplifies pretraining behaviors, creating echo chambers that can lock in shortcuts rather than diverse reasoning (2025).
• CoT length shows brittle correlation with problem complexity; longer reasoning does not reliably indicate better problem-fitting (2025).
• Sample difficulty plays a mechanistic role in RLVR outcomes, but the relationship is non-monotonic — optimal difficulty depends on learner state (2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (Echo Chamber, 2025-04)
• arXiv:2509.07339 (Performative Thinking, 2025-09)
• arXiv:2605.28388 (Sample Difficulty in RLVR, 2026-05)
• arXiv:2501.09223 (Foundations, 2025-01)

Your task:
(1) RE-TEST THE LEARNER–EXAMPLE FIT CONSTRAINT. For each finding above, judge whether newer model scaling, curriculum scheduling, adaptive filtering methods, or multi-stage training (e.g., mixture-of-experts routing by difficulty, or dynamic threshold adjustment) have since relaxed or overturned it. Separate the durable claim (quality is relative to learner state) from perishable details (e.g., 'teacher refinement always fails'). Cite what changed it.
(2) Surface the strongest *disagreement* from the last 6 months: does any recent work claim that high-quality-only selection *does* outperform mixed-complexity, and on what grounds? Flag the tension explicitly.
(3) Propose two research questions that assume the regime may have moved: one on whether learned routing policies (model-to-model) can now overcome the fit problem; one on whether multi-task pretraining shifts the learner frontier such that 'hard' examples are no longer degenerate.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The 'best' training examples can actually hurt an AI — quality is relative to what the learner is ready to absorb.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8