INQUIRING LINE

Can data pruning and equal contribution be reconciled in optimal learning?

This explores an apparent contradiction in the corpus: one line of work says optimal learning treats every training example as an equal contributor, while another says you can throw away half your data by ranking examples and keeping only the hard ones — can both be true at once?


This explores an apparent contradiction the corpus stages between two results. On one side, work deriving optimal training from a compression objective produces a "Learning Law" in which every example contributes equally in the ideal learning process Does optimal language model learning maximize data compression?. On the other, ranking examples by difficulty (how often the model forgets them, how hard they are to fit) lets you prune the easy, redundant half and beat the usual power-law scaling curve — sometimes turning it exponential Can we prune training data without hurting model performance?. Equal contribution versus aggressive pruning sounds like a flat contradiction. It mostly isn't.

The reconciliation hinges on what "equal" is measuring. The Learning Law describes the optimal *process* — given the right data already in hand, the math that minimizes compression loss ends up weighting each retained example evenly over the course of training. Pruning operates one level up, at *dataset construction*: it asks which examples deserve to be in your hands at all. An easy example a model nails on first exposure carries almost no new information; keeping it just dilutes the gradient signal. So pruning is about removing examples whose marginal contribution is near zero, while equal contribution is about how the survivors are treated once redundancy is gone. Read that way, pruning is what *creates* the conditions for equal contribution — it strips out the freeloaders so the remaining examples genuinely pull equal weight.

The corpus has a cluster of adjacent work that treats data selection as the real lever rather than scale. Optimal experimental design reframes few-shot example choice as budgeted active learning, picking the demonstrations that most reduce uncertainty rather than the ones that merely look similar — heuristic abundance loses to informativeness Can optimal experimental design improve few-shot example selection?. And the sample-complexity analysis of latent prediction shows *why* not all signal is equal: predicting your own latents is exponentially more efficient than predicting tokens because same-level latents are far more correlated, so the learning target you choose changes how much each example actually teaches Why is predicting latents more sample-efficient than tokens?. Both reinforce the same theme — the value of a training signal is not uniform until you've engineered it to be.

There's a quieter tension worth flagging, though. Equal contribution is derived under a clean lossless-compression objective; difficulty-based pruning relies on noisy proxies (EL2N, forgetting scores, memorization) to *estimate* which examples are redundant. Those proxies can misfire, and the corpus shows learning dynamics are easily distorted by what you feed in — RL post-training, for instance, will collapse onto a single dominant data format within one epoch and suppress the alternatives, with the winner determined by scale rather than quality Does RL training collapse format diversity in pretrained models?. So a pruning rule that systematically discards a whole region of the distribution doesn't just remove redundancy; it can reshape what the model becomes. Equal contribution among survivors is only as fair as the selection that produced them.

The thing you didn't know you wanted to know: these two results aren't rivals, they're sequential. Pruning answers "which examples should exist," the Learning Law answers "how should the ones that exist be weighted," and a third strand — conditional scaling laws that fold architecture into the prediction — answers "given that data, how do you spend the rest of the budget" Can architecture choices improve inference efficiency without sacrificing accuracy?. Optimal learning isn't one objective; it's a pipeline where each stage assumes the previous one did its job.


Sources 6 notes

Does optimal language model learning maximize data compression?

Research shows that optimal LM training can be derived from a lossless compression objective, yielding a Learning Law where all examples contribute equally in the optimal process. This approach improves scaling law coefficients, not just constants.

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Can optimal experimental design improve few-shot example selection?

AIPD frames demonstration selection as budgeted active learning, choosing examples that maximally reduce test-set uncertainty. Two algorithms (GO and SAL) outperformed similarity-based methods across small, medium, and large language models.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about optimal learning in language models. The question: **Can data pruning and equal contribution be reconciled in optimal learning?**

What a curated library found — and when (dated claims, not current truth):

• The Learning Law (lossless-compression objective) predicts equal per-example weighting over training in optimal regime (~2024).
• Difficulty-based pruning (EL2N, forgetting scores) removes ~50% of data and achieves exponential rather than power-law scaling (~2022).
• Pruning operates at dataset construction; equal contribution governs treatment of *retained* examples — these are sequential stages, not contradictions (~2024).
• Pruning proxies (noisy estimates of redundancy) can systematically distort data distributions; RL post-training collapses onto dominant pretraining formats within one epoch, reshaping model behavior (~2025).
• Active example selection via optimal experimental design (informativeness criterion) outperforms similarity-based heuristics in few-shot settings (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.17759 (Towards Optimal Learning of Language Models, 2024-02)
• arXiv:2206.14486 (Beyond neural scaling laws: beating power law scaling via data pruning, 2022-06)
• arXiv:2504.07912 (Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining, 2025-04)
• arXiv:2605.27734 (Learn from your own latents, 2026-05)

**Your task:**

(1) **RE-TEST EACH CONSTRAINT.** For each claim above, determine whether newer models, improved pruning heuristics (beyond EL2N), learned data selection methods, or multi-stage curricula have since *relaxed* the tension between pruning's selectivity and equal contribution's universality. Separate the durable insight (data selection is a distinct optimization stage) from perishable limitations (current proxies are noisy; RL reshaping is inevitable). What resolved each constraint, or does it still hold?

(2) **Surface strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months that either challenges the sequential-stages framing or shows equal contribution fails even *after* aggressive pruning.

(3) **Propose 2 research questions** that assume the regime may have shifted: (a) Can learned pruning masks (gradient-based, not heuristic) preserve distribution shape while removing redundancy? (b) Does equal contribution hold across *heterogeneous* modalities or only within-token sequences?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines