Can selecting the right data subset outperform training on everything?
This explores whether curating a smaller, well-chosen training set can beat the brute-force approach of training on all available data — and the corpus answers with a fairly emphatic yes, while explaining *why* extra data can actively hurt.
This explores whether picking the right subset of training data can outperform using everything, and the collection's strongest finding is that it routinely can — sometimes dramatically. LESS uses gradient similarity to pick the 5% of instruction examples most aligned with a target capability, and training on that sliver consistently beats training on the full set Can we train better models on less data?. LIMA pushes the same idea to the extreme: 1,000 carefully curated alignment examples on a strong base model match models trained on orders of magnitude more data, because post-training mostly *activates* capabilities the model already has rather than teaching new ones Can careful curation replace massive alignment datasets?. And in vision, ranking examples by difficulty and pruning the redundant easy ones beats the usual power-law scaling — 50% of CIFAR-10 thrown away with no accuracy loss Can we prune training data without hurting model performance?.
The more interesting question is *why* less can beat more, and here the corpus reframes 'extra data' as not neutral but actively harmful. LESS's own explanation is that mixed datasets contain examples that hinder a specific skill by nudging the model's reasoning strategy away from what the task needs Can we train better models on less data?. In RL, overly hard samples are worse than useless: models learn degenerate shortcuts on near-impossible problems, and those shortcuts then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So 'train on everything' silently smuggles in examples that drag specific skills down.
A subtler thread is that the *right* subset is relative to the learner, not absolute. Teacher-refined data that is objectively higher quality still degrades a student model when it sits beyond the student's learning frontier — the student should filter refinements against its own statistical profile and keep only what's compatible Does teacher-refined data always improve student model performance?. That means there's no universal 'best subset': selection has to be conditioned on the model doing the learning, which is exactly the move LESS makes with target-aware gradients.
Worth noting that selection doesn't have to be a separate preprocessing step — it can be folded into training itself. DRO reuses a single cross-rollout variance statistic both to weight tokens and to filter out degenerate queries on the fly, getting 2–3× faster training by discarding bad comparisons mid-stream Can one statistical measure serve dual purposes in RL training?. So 'curate then train' and 'filter while training' are two faces of the same insight.
The thing you might not have expected to learn: the case for subset selection isn't really about saving compute. It's that the full dataset is a mixture of helpful, useless, and outright harmful examples, and the harm doesn't average out — it transfers into the model as forgotten skills and learned shortcuts. Curation wins not because small is efficient, but because big quietly includes its own poison.
Sources 6 notes
LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.