When do language models stop memorizing and start generalizing?
Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
The standard approach to measuring memorization — attempting to extract training data from the model — is fundamentally flawed. Language models can be coerced to output almost any string, so generation is not proof of memorization. Conversely, a model may memorize patterns (every other token, structural regularities) without reproducing text verbatim. Extraction is neither necessary nor sufficient.
The formal separation: unintended memorization is the information a model contains about a specific dataset (the bits that would change if a particular example were removed from training). Generalization is the information the model contains about the true data-generation process. By isolating and eliminating the generalization component, total memorization becomes measurable.
The key empirical finding: GPT-family models have an approximate capacity of 3.6 bits-per-parameter for unintended memorization. Models memorize training data until this capacity fills. At that point, a phase transition occurs — grokking begins, and unintended memorization decreases as models begin to generalize.
This reframes the grokking phenomenon mechanistically. Since What happens inside models when they suddenly generalize?, the capacity-filling measurement adds the trigger condition: grokking doesn't begin at an arbitrary training step — it begins when memorization saturates. The three phases are downstream of a capacity constraint, not of training duration per se.
The practical implication: memorization capacity is a measurable property of a specific model, not a property of the training algorithm. Two models trained by the same algorithm on the same data can have different memorization properties. This matters for privacy (which models leak more), for understanding generalization (capacity constrains when it begins), and for the Can AI pass every test while understanding nothing? question — a model that appears to generalize may simply have unfilled memorization capacity.
Inquiring lines that use this note as a source 25
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does in-context learning trigger phase transitions in model behavior?
- Do grokking phases correspond to transitions between nesting levels?
- How much does memorization capacity limit a model's ability to learn new information?
- How does memorization capacity saturation trigger the grokking transition?
- Can we detect and measure circuit formation before generalization emerges?
- How do the three grokking phases connect to memorization capacity limits?
- Can data pruning strategies exploit the finite nature of memorization capacity?
- Do models with unfilled memorization capacity appear to generalize falsely?
- Why is extracting training data insufficient proof that models memorize?
- Why does grokking reveal the shift from memorization to genuine understanding?
- How do retention gates regularize forgetting across different sequence model architectures?
- Where does inference compute stop substituting for model capacity?
- How does modeling capability relate to lossless compression in language models?
- How do overparameterization and data size shift what attractors represent?
- What makes data augmentation an implicit form of contraction learning?
- What distinguishes data that generalizes broadly from task-specific memorization?
- Does grokking in modular arithmetic follow the same three-phase learning trajectory?
- What makes naive memory consolidation regress below having no memory at all?
- What makes a learned consolidation rule lossy and where does contamination enter?
- Can sparsity patterns reliably indicate how well a model knows its input?
- What capacity threshold determines whether RL teaches activation versus shortcut learning?
- How do models develop dense representations for familiar training data?
- How does in-weight memorization scale with model parameter count?
- What is the theoretical capacity limit before memorization saturates?
- What empirical evidence supports the Learning Law on real language models?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
capacity-filling provides the trigger mechanism for when grokking begins
-
Can we predict keyword priming before learning happens?
Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
a complementary view of how memorization interacts with learning
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
if memorization has finite capacity, pruning removes low-value items that consume capacity
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- How much do language models memorize?
- Localizing Paragraph Memorization in Language Models
- Provable Benefits of In-Tool Learning for Large Language Models
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- Repeat After Me: Transformers are Better than State Space Models at Copying
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- How new data permeates LLM knowledge and how to dilute it
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Original note title
llm memorization formally separates into unintended memorization and generalization — 3.6 bits-per-parameter capacity fills before grokking begins