Can data pruning strategies exploit the finite nature of memorization capacity?
This explores whether knowing that models have a fixed memorization budget — a hard ceiling on what they can store — could guide which data (or which tokens) we keep versus discard during training.
This explores whether knowing that models have a fixed memorization budget — a hard ceiling on what they can store — could guide what we keep versus throw away when training. The corpus doesn't have a paper that directly proposes "prune your dataset to fit the memorization ceiling," but it assembles every piece you'd need to reason about why that idea is plausible, and where it gets complicated.
The anchor is the discovery that memorization capacity is finite and measurable. GPT-family models store roughly 3.6 bits per parameter, and once that budget fills, something striking happens: a phase transition kicks in and the model stops memorizing and starts generalizing — the so-called grokking shift When do language models stop memorizing and start generalizing?. The crucial detail is that this capacity is a property of the model, not the training recipe. That reframes pruning entirely: if capacity is fixed, then every redundant or low-value example you feed the model is competing for a scarce resource. Pruning isn't just about saving compute — it's about not wasting bits on data that doesn't earn its place.
The corpus's most concrete version of "exploit the budget" operates at the token level rather than the example level. One line of work shows that models internally rank tokens by functional importance — symbolic computation tokens get preserved while grammar and meta-discourse get discarded first — and that students trained on these intelligently pruned reasoning chains actually outperform students trained on frontier-model compressions Which tokens in reasoning chains actually matter most?. That's pruning that respects what's worth memorizing. The mirror image is just as instructive: when memorization goes wrong in reasoning, it's dominated by *local* memorization from immediately preceding tokens, which accounts for up to two-thirds of errors Where do memorization errors arise in chain-of-thought reasoning?. So the same finite store that grokking eventually converts into generalization can, if filled with the wrong material, lock in shortcut errors — which is exactly the failure a capacity-aware pruning strategy would want to avoid.
There's an even sharper signal that less data can do more. In reinforcement learning with verifiable rewards, a *single* well-chosen training example lifted math accuracy from 36% to 73.6% and kept improving test performance for 1,400 steps after training accuracy maxed out Can a single training example unlock mathematical reasoning?. That's the strongest evidence in the corpus that capacity isn't filled by volume — it's *activated* by the right examples. If one example can unlock latent capability, then aggressive pruning isn't a compromise; it can be the point.
The lateral surprise comes from architectures that treat memorization as a budget to be allocated rather than maximized. Wide & Deep models split the work so the memorization half stays small — it only patches the rare cases the generalization half handles poorly, which means you deliberately spend memorization capacity on a curated minority of inputs Can one model memorize and generalize better than two?. Neural memory modules push this further by storing only *surprising* tokens, effectively pruning the predictable ones out of long-term memory in real time Can neural memory modules scale language models beyond attention limits?. Read together, these suggest the answer to your question is yes-in-principle: the finite store is real, surprise and functional importance are computable proxies for what deserves a slot, and both data and architectures already exploit that — but no one in this collection has yet closed the loop by pruning a dataset explicitly against the measured bits-per-parameter ceiling. That gap is where the interesting work is.
Sources 6 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.