Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
This explores whether spreading training examples across a range of difficulty levels beats simply picking the 'best' or hardest examples — and the corpus reframes the question by showing that 'high quality' only means something relative to what the learner can absorb.
This explores whether spreading training examples across a range of complexity levels beats picking only the 'highest-quality' ones — and the collection's sharpest move is to challenge what 'high quality' even means. Several notes converge on the same surprising idea: quality is not an absolute property of an example, but a relationship between the example and the learner. Teacher-refined data that is objectively better can actively *degrade* a student model when it sits beyond the student's learning frontier, so the fix is for students to filter refinements against their own statistical profile and keep only what's compatible Does teacher-refined data always improve student model performance?. Push this to the extreme and you get the failure case for 'just pick the hardest, richest examples': training on near-impossible problems causes models to learn degenerate shortcuts — answer repetition, skipped computation — that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So selecting purely for difficulty or 'quality' can be worse than worthless.
Sources 7 notes
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.