Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments
Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-BENCH), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-BENCH spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this—in fact, naive ICL outperforms systems dedicated to memory management.
Introduction. Building LLM systems that improve through sequential experience (continual learning) has attracted substantial interest from researchers and practitioners alike. Recent work focuses on developing memory-based and adaptive AI systems intended to operate over long time horizons: software engineering agents that become more effective within a codebase over weeks of interaction, data science agents that learn from repeated interaction with the same datasets, and decision-support agents that refine predictions using ongoing feedback. These systems commonly incorporate memory retrieval modules [27, 8, 39], context compaction methods [12], and test-time training objectives and architectures [30, 29, 18, 31, 43, 21]. Yet existing evaluation protocols only partially capture this form of continual learning.
Discussion / Conclusion. We introduce CL-BENCH, the first difficult, expert-validated benchmark for measuring whether AI systems genuinely improve through sequential experience. Spanning six real-world domains with verifiable rewards and analysis metrics that isolates learning from underlying model capability, CL-BENCH enables rigorous comparison of continual learning strategies at frontier scale that is increasingly important as AI agents are deployed in online settings. We expect CL-BENCH to grow with community contributions in varied and more difficult domains, and define clear criteria for what proposed tasks in a high-quality continual learning benchmark should look like to enable this. Our evaluation reveals a gap in current systems’ ability to continually learn. Naive ICL outperforms dedicated memory architectures on most tasks, and even the best system achieves only 25.4% normalized gain over its stateless baseline. Accumulated state frequently hurts rather than helps: memory modules introduce spurious generalizations and stale beliefs, while more expensive systems fail to translate cost into performance.