INQUIRING LINE

How do the three grokking phases connect to memorization capacity limits?

This explores whether the three phases of grokking (the late, sudden jump from memorizing to generalizing) are actually triggered by a model running out of room to memorize — i.e., a hard capacity ceiling.


This explores whether the three phases of grokking are connected to a model running out of memorization room. The corpus suggests they're not just connected — capacity saturation is the trigger. One line of work measures a concrete number: GPT-family models hold roughly 3.6 bits of memorization per parameter, and once that budget fills, the phase transition into grokking begins When do language models stop memorizing and start generalizing?. So memorization capacity isn't a side detail; it's the clock that decides when generalization can start.

The mechanistic view fills in what happens across the three phases. A model first memorizes by building lookup-table-like circuits, then gradually grows generalizing circuits alongside them, and finally prunes the memorization machinery away What happens inside models when they suddenly generalize?. Externally this looks like a sudden flip, but internally it's continuous — and the thing that kicks it off is memorization capacity saturating. Put the two notes together and the story is tidy: the bits-per-parameter ceiling is the pressure, and the three-phase circuit reorganization is how the model relieves it. When there's no more room to store individual answers, the cheaper move is to learn the rule.

What's interesting is that 'capacity' here isn't only about storage room — it can also be about compute. A separate line argues the real bottleneck in long context isn't how much a model can hold but how much computation it takes to fold that information into its weights, and that performance keeps improving with more consolidation passes Is long-context bottleneck really about memory or compute?. That reframes the grokking transition: the slow middle phase, where generalizing circuits form, may be compute-limited work, not just a passive wait for capacity to fill.

The corpus also complicates the clean memorize-then-generalize binary. In chain-of-thought reasoning, models don't fully abandon memorization — they do both at once, with memorization, raw output probability, and noisy step-by-step reasoning operating as separable factors What three separate factors drive chain-of-thought performance?. And memorization itself isn't monolithic: token-level analysis finds local, mid-range, and long-range sources, with short-range copying driving most reasoning errors Where do memorization errors arise in chain-of-thought reasoning?. So 'pruning memorization' in phase three is less a clean deletion and more a rebalancing of which memorization survives.

If you want the most striking adjacent finding: a single training example can push math accuracy from 36% to 73.6% and keep test accuracy climbing for 1,400 steps after training accuracy already hit 100% Can a single training example unlock mathematical reasoning?. That post-saturation generalization is grokking by another name — improvement continuing well past the point where there's nothing left to memorize, which is exactly what the capacity story predicts. Designs that deliberately split the two jobs, like joint memorization-plus-generalization training, show the same logic from the other direction: give each component its own room and neither has to crowd the other out Can one model memorize and generalize better than two?.


Sources 7 notes

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

What happens inside models when they suddenly generalize?

Models trained past overfitting generalize through three stages: memorization via lookup tables, gradual formation of generalizing circuits, then pruning of memorization components. Mechanistic analysis shows this appears discontinuous externally but progresses continuously, triggered by memorization capacity saturation.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question: do grokking's three phases (memorization → mixed → generalization) actually reflect saturation of a model's memorization capacity, or are there independent drivers?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, treating capacity saturation as grokking's trigger:
• Models hold ~3.6 bits of memorization per parameter; once that budget fills, phase transition begins (2025).
• Phase three ('pruning') is less deletion, more rebalancing: memorization persists in chain-of-thought as separable factors—output probability, step-by-step reasoning, memorization all operate concurrently (2024).
• Token-level memorization has three sources (local, mid-range, long-range); short-range copying dominates reasoning errors; test accuracy climbs 1,400 steps post-saturation (2025).
• The real bottleneck may be compute-to-consolidate context into weights, not storage capacity alone; performance improves with more consolidation passes (2025).
• Joint memorization-plus-generalization architectures show both components need separate capacity; neither crowds the other when room is given (2016, revived 2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.24832 (2025-05) — How much do language models memorize?
• arXiv:2407.01687 (2024-07) — Deciphering Factors in Chain-of-Thought: Probability, Memorization.
• arXiv:2508.02037 (2025-08) — Diagnosing Memorization in CoT, One Token at a Time.
• arXiv:2504.20571 (2025-04) — Reinforcement Learning for Reasoning with One Training Example.

Your task:
(1) RE-TEST: For each claim (capacity ceiling, phase transitions, post-saturation learning), check whether 2026's largest models (or 2024–2025 scaling laws) show the 3.6-bit-per-param rule holding, breaking, or sliding. Does the phase three 'rebalancing' story survive token-level analysis in recent CoT work, or do newer models show cleaner pruning? Separate the durable question (does capacity pressure trigger structure change?) from perishable claims (the exact bits/param, the timing of transitions).
(2) Surface contradicting or superseding work from the last ~6 months: look for papers claiming memorization and generalization are NOT capacity-driven, or that don't appear in this path.
(3) Propose 2 research questions that assume the regime has shifted—e.g., what if compute consolidation, not capacity, is the primary throttle? What if token-level memorization sources have collapsed into one dominant pathway in newer models?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines