Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
"Beyond Neural Scaling Laws" (2206.14486) challenges the assumption that scaling laws are fixed. Power-law scaling of error with dataset size implies massive redundancy — many training examples contribute marginally. If you can rank examples by difficulty or importance and prune the easy/redundant ones, you can beat the power law.
The theory proves exponential scaling is possible with an ideal pruning metric. The practice confirms better-than-power-law scaling on ResNets trained on CIFAR-10, SVHN, and ImageNet.
The pruning metrics reveal a taxonomy of training example difficulty:
- EL2N scores: Average L2 norm of error vector from small ensemble trained briefly. 50% of CIFAR-10 prunable without accuracy loss.
- Forgetting scores: How many times an example is learned and unlearned during training. Never-forgotten examples are redundant.
- Memorization scores: How much the presence of an example in training increases correct-label probability. High memorization = the example must be individually learned (not derivable from other data).
- Influence scores: How much an example affects test set performance.
The key insight: easy examples (low forgetting, low memorization, low EL2N) are redundant with the rest of the data. Hard examples are irreducibly necessary. Pruning easy examples preserves all the information that matters.
Since Can we train better models on less data?, the data pruning finding extends from instruction tuning to pretraining. The principle is the same — data efficiency comes from identifying the valuable subset — but the mechanisms differ. LESS uses gradient-based influence; data pruning uses difficulty metrics. Both converge on: most training data is redundant, and identifying the valuable fraction is the key optimization.
A practical challenge remains: most high-performing metrics are computationally expensive and require labels. The paper develops a self-supervised pruning metric that scales to ImageNet with comparable performance — making data pruning viable for large unlabeled corpora.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What is craft-residue and why does its loss matter?
- Why do easy training examples contribute less to model generalization than hard ones?
- Can gradient-based influence scores beat difficulty metrics for identifying valuable training data?
- What makes a self-supervised pruning metric work without labels at scale?
- Can selecting the right data subset outperform training on everything?
- What makes utility-weighted training backfire in machine learning systems?
- What training data contamination rates threaten model safety most practically?
- Does trace length actually reflect problem difficulty or training proximity?
- How do difficulty metrics relate to the true value of training examples?
- What mechanisms cause overly hard samples to degrade prior model performance?
- Can data pruning and equal contribution be reconciled in optimal learning?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we train better models on less data?
Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
same principle for instruction tuning: identify the valuable subset
-
Can training data augmentation match test-time compute scaling benefits?
Can generating thinking trajectories during pretraining unlock the same efficiency gains that test-time scaling provides at inference? This explores whether the compute-allocation principle works across the training-inference boundary.
complementary approach: augment rather than prune
-
When do language models stop memorizing and start generalizing?
Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
if memorization has finite 3.6 bits-per-parameter capacity, pruning easy (redundant) examples frees capacity for generalization to begin sooner
-
Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
adversarial counterpart to data pruning: while pruning removes redundant data to improve efficiency, priming shows that even 3 exposures of novel data disproportionately reshape model behavior; both demonstrate unequal training example impact, from opposite directions
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
extreme case of data value concentration: catalyst data may represent the irreducibly necessary examples (high difficulty, high memorization score) that data pruning would preserve; both converge on the principle that a small fraction of maximally informative examples carries disproportionate training signal
-
What can a bounded observer actually learn from data?
Classical information measures treat all high-entropy content equally, but computationally bounded learners can only extract certain types of structure. What distinguishes learnable regularity from random noise that bounded agents face?
grounds: difficulty metrics empirically approximate the learnable-value quantity epiplexity proposes to measure directly
-
Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
extends: difficulty-based selection applied to RLVR rollouts, with a non-monotonic optimal band
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond neural scaling laws: beating power law scaling via data pruning
- Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- Embarrassingly Shallow Autoencoders for Sparse Data*
- Adam's Law: Textual Frequency Law on Large Language Models
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
Original note title
data pruning based on difficulty metrics can achieve exponential rather than power-law scaling — not all training examples are equally valuable