How does in-weight memorization scale with model parameter count?
This explores whether the facts a model can store directly in its weights grows with parameter count — and what the corpus says about the ceiling, the measured rate, and the workarounds.
This explores in-weight memorization — facts a model stores inside its parameters rather than looking up externally — and whether that capacity simply scales with size. The corpus gives a surprisingly precise answer: it scales linearly, at roughly a fixed exchange rate. GPT-family models hold about 3.6 bits of memorized information per parameter, and that number behaves like a physical property of the model rather than a quirk of how it was trained When do language models stop memorizing and start generalizing?. So 'more parameters means more memory' is true — but only in the literal, bounded sense of bits-per-parameter, not in any open-ended way.
The more interesting finding is what happens when that capacity fills up. Once a model has used its memorization budget, it doesn't just stop — it undergoes a phase transition into *grokking*, shifting from rote storage toward genuine generalization When do language models stop memorizing and start generalizing?. In other words, the parameter count sets a memory ceiling, and pressing against that ceiling is what pushes a model to start abstracting instead of memorizing. Memorization and generalization aren't separate model types; they're two regimes the same network moves between as capacity saturates.
Because the ceiling is real, a separate line of work argues the smarter move is to stop scaling parameters for facts at all. A formal proof shows in-weight factual recall is fundamentally bounded by model size, while *tool use* — letting the model call out to an external lookup — decouples recall from parameter count entirely, giving effectively unbounded facts through a simple circuit Can models store unlimited facts without growing larger?. The same work flags the hidden cost of cramming facts in via fine-tuning: it overwrites prior knowledge and degrades general ability. That reframes the scaling question — the bottleneck isn't 'how big,' it's 'why store it in weights when you don't have to.'
The corpus also tells you *where* in the network this memory lives, which matters for anyone who wants to edit or remove it. Memorized passages leave a localized fingerprint — larger gradients in lower layers and a specific attention head fixating on rare tokens — making memorization targetable rather than smeared across the whole model Where does a model store memorized paragraphs?. So capacity isn't uniformly distributed; it concentrates, which is why a 3.6-bits-per-parameter average can coexist with very specific, surgically-removable memories.
Two adjacent framings round this out. Recommender architectures faced this exact tension early: Wide & Deep models split memorization (a sparse cross-product tower) from generalization (dense embeddings) and train them jointly, so the memorizing half stays small because the generalizing half handles the common cases Can one model handle both memorization and generalization?. And rather than buying capacity with parameters, the *Sleep* paradigm consolidates in-context knowledge into weights through offline distillation and rehearsal — adding memory without adding size or forgetting Can models consolidate memories during offline sleep phases?. The throughline across all of these: parameter count buys you a fixed, measurable memory budget — and most of the recent ideas are about spending it more wisely rather than just buying more.
Sources 5 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.
Wide & Deep architectures train a sparse cross-product tower and a dense embedding tower together, allowing the wide part to patch only the deep part's weaknesses. This joint approach requires smaller models than ensemble methods.
The Sleep paradigm uses Knowledge Seeding (distilling smaller networks into larger ones) and Dreaming (RL-generated rehearsal) to consolidate in-context knowledge into weights without forgetting. Gains appear in long-context understanding, few-shot reasoning, and continual learning.