It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

Paper · arXiv 2504.13173 · Published April 17, 2025

!Pasted image 20251216072354.png

Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias—the natural tendency to prioritize certain events or stimuli—we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) l2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm.

Introduction. Designing efficient architectural backbones for sequence modeling is a key to enhance the capability of foundation models in domains ranging from language (Behrouz et al. 2024c; Vaswani et al. 2017a) and computer vision (Dosovitskiy et al. 2020) to computational biology (Wang et al. 2024) and neuroscience (Behrouz et al. 2024a). While Transformers (Vaswani et al. 2017a), mainly due to their in-context learning and ability to learn at scale (Kaplan et al. 2020), have been firmly established as state-of-the-art (SOTA) models in sequence modeling, their quadratic time and space complexity limits their applicability in tasks that require long context modeling (Dalal et al. 2025; Li et al. 2024a; Liu et al. 2024b). Recent efforts aim to overcome Transformer limitations in long-context modeling by designing efficient recurrent alternatives (Behrouz et al. 2024c; Neil et al. 2017; Smith et al. 2022). Unlike Transformer’s linearly growing memory (i.e., the KV cache), these models compress the context into a fixed size memory, demanding improved memory management for comparable performance.

Discussion / Conclusion. In this paper, we present Miras, a general framework that explains the connection of online optimization and test time memorization. Miras framework can explain the role of several standard architectural choices in the literature (e.g., forget gate) and helps design next generation of architectures that are capable of managing the memory better. Building upon our framework, we present three novel sequence models, each of which with its own (dis)advantages. Our experimental evaluations show that all these variants are more powerful than Transformers and linear RNNs, in various downstream tasks. In this work, we present a diverse set of variants using Miras. In future, exploring these alternative architectures for different downstream tasks is an interesting future direction.

It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

Synthesis notes that discuss concepts related to this paper