INQUIRING LINE

How does in-weight memorization scale with model parameter count?

This explores whether the facts a model can store directly in its weights grows with parameter count — and what the corpus says about the ceiling, the measured rate, and the workarounds.


This explores in-weight memorization — facts a model stores inside its parameters rather than looking up externally — and whether that capacity simply scales with size. The corpus gives a surprisingly precise answer: it scales linearly, at roughly a fixed exchange rate. GPT-family models hold about 3.6 bits of memorized information per parameter, and that number behaves like a physical property of the model rather than a quirk of how it was trained When do language models stop memorizing and start generalizing?. So 'more parameters means more memory' is true — but only in the literal, bounded sense of bits-per-parameter, not in any open-ended way.

The more interesting finding is what happens when that capacity fills up. Once a model has used its memorization budget, it doesn't just stop — it undergoes a phase transition into *grokking*, shifting from rote storage toward genuine generalization When do language models stop memorizing and start generalizing?. In other words, the parameter count sets a memory ceiling, and pressing against that ceiling is what pushes a model to start abstracting instead of memorizing. Memorization and generalization aren't separate model types; they're two regimes the same network moves between as capacity saturates.

Because the ceiling is real, a separate line of work argues the smarter move is to stop scaling parameters for facts at all. A formal proof shows in-weight factual recall is fundamentally bounded by model size, while *tool use* — letting the model call out to an external lookup — decouples recall from parameter count entirely, giving effectively unbounded facts through a simple circuit Can models store unlimited facts without growing larger?. The same work flags the hidden cost of cramming facts in via fine-tuning: it overwrites prior knowledge and degrades general ability. That reframes the scaling question — the bottleneck isn't 'how big,' it's 'why store it in weights when you don't have to.'

The corpus also tells you *where* in the network this memory lives, which matters for anyone who wants to edit or remove it. Memorized passages leave a localized fingerprint — larger gradients in lower layers and a specific attention head fixating on rare tokens — making memorization targetable rather than smeared across the whole model Where does a model store memorized paragraphs?. So capacity isn't uniformly distributed; it concentrates, which is why a 3.6-bits-per-parameter average can coexist with very specific, surgically-removable memories.

Two adjacent framings round this out. Recommender architectures faced this exact tension early: Wide & Deep models split memorization (a sparse cross-product tower) from generalization (dense embeddings) and train them jointly, so the memorizing half stays small because the generalizing half handles the common cases Can one model handle both memorization and generalization?. And rather than buying capacity with parameters, the *Sleep* paradigm consolidates in-context knowledge into weights through offline distillation and rehearsal — adding memory without adding size or forgetting Can models consolidate memories during offline sleep phases?. The throughline across all of these: parameter count buys you a fixed, measurable memory budget — and most of the recent ideas are about spending it more wisely rather than just buying more.


Sources 5 notes

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

Can models store unlimited facts without growing larger?

A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.

Where does a model store memorized paragraphs?

Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.

Can one model handle both memorization and generalization?

Wide & Deep architectures train a sparse cross-product tower and a dense embedding tower together, allowing the wide part to patch only the deep part's weaknesses. This joint approach requires smaller models than ensemble methods.

Can models consolidate memories during offline sleep phases?

The Sleep paradigm uses Knowledge Seeding (distilling smaller networks into larger ones) and Dreaming (RL-generated rehearsal) to consolidate in-context knowledge into weights without forgetting. Gains appear in long-context understanding, few-shot reasoning, and continual learning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: does in-weight memorization scale predictably with model parameter count, and what does that scaling tell us about the tradeoff between storing facts in weights versus deferring to tools?

What a curated library found — and when (dated claims, not current truth): These findings span 2016–2026.
• In-weight memorization scales linearly at ~3.6 bits per parameter, behaving as a fixed physical property rather than a training artifact (2025).
• Once memorization capacity saturates, models undergo a phase transition into grokking — shifting from rote storage toward genuine generalization rather than simply failing (2024–2025).
• Tool use provably decouples factual recall from parameter count, offering effectively unbounded facts without parameter scaling (2025).
• Memorized passages localize to low-layer gradients and specific attention heads on rare tokens, making memories surgically targetable (2024).
• Sleep-like consolidation (offline distillation + rehearsal) can add memory capacity without increasing model size or inducing forgetting (2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.24832 (2025) — How much do language models memorize?
• arXiv:2508.20755 (2025) — Provable Benefits of In-Tool Learning for Large Language Models
• arXiv:2403.19851 (2024) — Localizing Paragraph Memorization in Language Models
• arXiv:2606.03979 (2026) — Language Models Need Sleep

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 3.6 bits/parameter claim, newer-scale models (2025–2026), longer-context training, and mixture-of-experts architectures may alter the constant; check whether the linear regime still holds and whether grokking/phase-transition framing survives at modern scales. Separately, verify whether tool-use decoupling remains practically undefeated or if in-weight retrieval has re-converged with external lookup under chain-of-thought or multi-agent orchestration.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — particularly any showing nonlinear memorization scaling, or tool use introducing new bottlenecks (latency, hallucination, reasoning depth).
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Does compositional generalization (arXiv:2507.07207) erode the memorization–generalization phase boundary? (b) Can continual learning harnesses (e.g., Titans, Sleep) make the bits-per-parameter budget adaptive rather than fixed?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines