SYNTHESIS NOTE

Do memory systems actually help language models learn continuously?

When you subtract what a model already knows, do dedicated memory architectures genuinely enable continual learning, or do they mainly inherit base capability? CL-BENCH isolates learning from prior skill to test this.

Synthesis note · 2026-06-27 · sourced from Autonomous Agents

CL-BENCH is built to ask a question the field has been assuming away: do LLM systems actually improve with sequential experience, beyond what their base capability already gives them? Six expert-validated domains (software engineering, signal processing, outbreak forecasting, DB querying, strategic games, demand forecasting) are designed so tasks share a learnable latent structure — a codebase layout, opponent strategies, disease dynamics — that a stateful system can discover online but a stateless one cannot. The decisive design choice is a gain metric that subtracts off prior capability, isolating learning from being-already-good. The result is uncomfortable: naive in-context learning outperforms dedicated memory systems on most tasks, and the best system manages only 25.4% normalized gain over its stateless baseline. Memory modules introduce spurious generalizations and stale beliefs; accumulated state frequently hurts rather than helps.

This is the empirical hammer the optimistic memory literature needs. It complicates Can agents learn continuously from experience without updating weights? and Can frozen language models continually improve through memory structure alone? — both report continual gains from textual memory, but CL-BENCH's gain metric suggests headline numbers may partly reflect base capability rather than learning, and that the advantage is fragile across domains. The mechanism it exposes converges with Does agent memory degrade when continuously consolidated?: writing more state is not monotonically good, because the consolidation/retrieval step itself is the failure point.

The harness implication is sharp. If the skill/memory lifecycle that powers self-evolving agents can drop below a stateless baseline, then the bottleneck is not adding memory but controlling what gets written and trusted. The strongest counterargument is benchmark coverage — six domains and a single gain metric may understate cases where state compounds over much longer horizons than CL-BENCH evaluates, and ICL's edge may erode once interaction histories exceed the context window the way persistent memory is meant to survive.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 90 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

dedicated memory systems lose to naive in-context learning on genuine continual-learning tasks — accumulated state hurts more than it helps