Can data pruning and equal contribution be reconciled in optimal learning?
This explores an apparent contradiction in the corpus: one line of work says optimal learning treats every training example as an equal contributor, while another says you can throw away half your data by ranking examples and keeping only the hard ones — can both be true at once?
This explores an apparent contradiction the corpus stages between two results. On one side, work deriving optimal training from a compression objective produces a "Learning Law" in which every example contributes equally in the ideal learning process Does optimal language model learning maximize data compression?. On the other, ranking examples by difficulty (how often the model forgets them, how hard they are to fit) lets you prune the easy, redundant half and beat the usual power-law scaling curve — sometimes turning it exponential Can we prune training data without hurting model performance?. Equal contribution versus aggressive pruning sounds like a flat contradiction. It mostly isn't.
The reconciliation hinges on what "equal" is measuring. The Learning Law describes the optimal *process* — given the right data already in hand, the math that minimizes compression loss ends up weighting each retained example evenly over the course of training. Pruning operates one level up, at *dataset construction*: it asks which examples deserve to be in your hands at all. An easy example a model nails on first exposure carries almost no new information; keeping it just dilutes the gradient signal. So pruning is about removing examples whose marginal contribution is near zero, while equal contribution is about how the survivors are treated once redundancy is gone. Read that way, pruning is what *creates* the conditions for equal contribution — it strips out the freeloaders so the remaining examples genuinely pull equal weight.
The corpus has a cluster of adjacent work that treats data selection as the real lever rather than scale. Optimal experimental design reframes few-shot example choice as budgeted active learning, picking the demonstrations that most reduce uncertainty rather than the ones that merely look similar — heuristic abundance loses to informativeness Can optimal experimental design improve few-shot example selection?. And the sample-complexity analysis of latent prediction shows *why* not all signal is equal: predicting your own latents is exponentially more efficient than predicting tokens because same-level latents are far more correlated, so the learning target you choose changes how much each example actually teaches Why is predicting latents more sample-efficient than tokens?. Both reinforce the same theme — the value of a training signal is not uniform until you've engineered it to be.
There's a quieter tension worth flagging, though. Equal contribution is derived under a clean lossless-compression objective; difficulty-based pruning relies on noisy proxies (EL2N, forgetting scores, memorization) to *estimate* which examples are redundant. Those proxies can misfire, and the corpus shows learning dynamics are easily distorted by what you feed in — RL post-training, for instance, will collapse onto a single dominant data format within one epoch and suppress the alternatives, with the winner determined by scale rather than quality Does RL training collapse format diversity in pretrained models?. So a pruning rule that systematically discards a whole region of the distribution doesn't just remove redundancy; it can reshape what the model becomes. Equal contribution among survivors is only as fair as the selection that produced them.
The thing you didn't know you wanted to know: these two results aren't rivals, they're sequential. Pruning answers "which examples should exist," the Learning Law answers "how should the ones that exist be weighted," and a third strand — conditional scaling laws that fold architecture into the prediction — answers "given that data, how do you spend the rest of the budget" Can architecture choices improve inference efficiency without sacrificing accuracy?. Optimal learning isn't one objective; it's a pipeline where each stage assumes the previous one did its job.
Sources 6 notes
Research shows that optimal LM training can be derived from a lossless compression objective, yielding a Learning Law where all examples contribute equally in the optimal process. This approach improves scaling law coefficients, not just constants.
Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.
AIPD frames demonstration selection as budgeted active learning, choosing examples that maximally reduce test-set uncertainty. Two algorithms (GO and SAL) outperformed similarity-based methods across small, medium, and large language models.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.