Has memory architecture replaced parameter count as the scaling frontier?
Late-2025 research suggests the field's next major efficiency gains come from restructuring how models store and use experience rather than simply making them larger. Three convergent signals point to this shift.
Three pieces of late-2025 memory research, taken together, point at the same shift: parameter count has stopped being the most useful axis to scale. Memory architecture has taken its place.
Signal one: the field can finally taxonomize itself. Two major surveys (Memory in the Age of AI Agents, AI Hippocampus) appearing within months of each other propose orthogonal but compatible three-axis taxonomies — forms × functions × dynamics, and implicit × explicit × agentic. Surveys taxonomize after-the-fact; their existence at this density means the design space has matured to the point where comparing systems requires a shared vocabulary. Fields only develop that need when architecture is the primary variable being designed.
Signal two: memory and compute scale together, not separately. ReasoningBank's MaTTS finding shows that test-time scaling generates contrastive signals, which improve memory, which guides future scaling — a compounding loop. This makes memory-driven experience scaling a new scaling law rather than a multiplier on existing ones. Parameter scaling laws (Kaplan, Chinchilla) predict loss as a function of compute and data; MaTTS suggests an additional term: cumulative interaction history processed into structured memory.
Signal three: sparsity is multi-dimensional. Engram's U-shaped scaling law shows that conditional memory and conditional computation are complementary sparsity axes — pure MoE underperforms hybrid MoE+lookup at iso-parameter, iso-FLOPs. The largest gains appear in reasoning, not retrieval, because separating local lookup from global integration frees attention for composition. Parameters distributed across memory and computation outperform parameters concentrated in either alone.
The convergent story: returns from adding parameters are diminishing along a known curve; returns from restructuring memory are still in their early steep phase. This does not mean parameters stop mattering. It means the marginal next-generation improvement is more likely to come from architectural restructuring of memory than from another order of magnitude in size.
The counter-evidence — and why it sharpens rather than undermines the take. "Useful Memories Become Faulty" demonstrates that naive consolidation can regress below the no-memory baseline. This is exactly what should be expected if memory architecture is the bottleneck: the design choices in how to maintain memory matter more than whether to have it. The fragility is itself evidence that memory is the active variable. Parameter-count scaling does not have the same brittleness — adding parameters rarely makes a model worse. Adding consolidation can.
The writing angle: the prior scaling law era was about pretraining compute. The current era is about memory structures that determine how experience gets converted into improved behavior — and that conversion mechanism is now the design problem.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does architectural discovery follow an empirical scaling law like neural networks?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- What architectural changes would accelerate the cleanup phase?
- How do conditional scaling laws incorporate hardware into architecture choices?
- Why do scaling laws show capability saturation at specific thresholds?
- Can depth scaling and breadth scaling unlock independent capability axes?
- Why do production teams choose expensive frontier models over fine-tuning?
- Can memory and test-time compute scale together as a single axis?
- Why do short interaction benchmarks fail to predict long horizon performance?
- Can zero-weight drift through external memory replace parameter plasticity entirely?
- How do the three-axis taxonomies of memory forms and functions differ?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can three axes replace the short-term long-term memory split?
Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.
taxonomy signal
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
MaTTS as new scaling axis
-
Can lookup memory and computation work together better than either alone?
Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
Engram U-curve
-
Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
fragility as evidence that memory is the active variable
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
architectural memory restructuring for working layer
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans/Miras as memory-architecture shift
-
Is agent memory a storage problem or a connectivity problem?
Most systems treat memory as a repository to store and retrieve. But what if memory's real usefulness depends on how units are linked together rather than what is stored?
extends: connectivity-not-storage specifies which memory design choice the scaling-dimension thesis depends on
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- AlphaGo Moment for Model Architecture Discovery
- Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
- The Serial Scaling Hypothesis
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- Provable Benefits of In-Tool Learning for Large Language Models
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- OMNI-SIMPLEMEM: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
Original note title
memory architecture is the new scaling dimension — taxonomy surveys plus MaTTS plus Engram U-curve suggest memory has overtaken parameter count