What capacity limits does the memory model face as corpus grows?
This explores 'memory' in the broad sense — how a model holds and recalls a growing body of facts or context — and asks where the ceiling is: is it the parameters, the context window, or something else entirely?
This explores how a model's ability to store and recall information breaks down as the corpus it must hold keeps growing — and the corpus turns out to have several different ceilings depending on *where* the memory lives. The most concrete limit is inside the weights themselves. One line of work measures a fixed memorization capacity of roughly 3.6 bits per parameter; once that budget fills, the model stops memorizing and starts generalizing — a phase transition called grokking When do language models stop memorizing and start generalizing?. So a model can only hold so many raw facts in-weight, and that number is set by its size, not by how long you train it.
That ceiling is exactly why external memory matters. A formal result shows that in-weight factual recall is bounded by parameter count, but tool use — letting the model look facts up rather than store them — decouples recall from model size and makes it effectively unbounded Can models store unlimited facts without growing larger?. The same work warns about the cost of trying to cram more in by fine-tuning: writing new facts into the weights overwrites old knowledge and degrades general capability. Tools don't just dodge the limit, they expand what the model can reason over in the first place Do tools actually expand what language models can reason about?.
If you instead push everything into a long context window, you hit a different wall. Counterintuitively, the bottleneck there isn't storage — it's the *compute* needed to digest evicted context into the model's internal state, which behaves like a test-time scaling problem where more consolidation passes buy better recall Is long-context bottleneck really about memory or compute?. And even well under the nominal window size, performance erodes: reasoning accuracy drops sharply with just a few thousand tokens of added input, far below any capacity limit Does reasoning ability actually degrade with longer inputs?. Long context can also stand in for retrieval on semantic lookups, but it collapses on structured, relational queries that require joining facts — length alone doesn't give you a database Can long-context LLMs replace retrieval-augmented generation systems?.
The most interesting answer to 'what happens as the corpus grows' is architectural: stop treating memory as one undifferentiated pool. Neural memory modules separate short-term attention from a compressed long-term store and selectively memorize only *surprising* tokens, which lets the system scale past two million tokens without the quadratic blowup of attention Can neural memory modules scale language models beyond attention limits?. The throughline across all of these is that there is no single 'memory capacity' — there's a parameter budget for facts, a compute budget for consolidating context, and a structural choice about what to store versus what to look up. The capacity question is really a question about which of those you're spending.
Sources 7 notes
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
A formal proof and experiments show in-weight memorization is bounded by model size, while tool-use enables unbounded factual recall through a simple circuit. In-weight finetuning also degrades general capability by overwriting prior knowledge.
Formal proof shows tool-integrated reasoning enables strategies impossible or prohibitively verbose in text alone, expanding both empirical and feasible support. The advantage spans abstract reasoning, not just arithmetic, and Advantage Shaping Policy Optimization stabilizes training without reward distortion.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.