INQUIRING LINE

Inquiring lines›How do language models construct a…›How do dialogue systems achieve ge…›How do transformer attention mecha…›this inquiring line

Transformers remember what matches your query — but what if surprise or repetition are smarter memory triggers?

What attentional bias objectives compete with dot product similarity for associative memory?

This explores what alternatives to plain query-key dot product decide what an attention-style associative memory stores and retrieves — surprise, repetition-prominence, or learned similarity functions — and how those objectives stack up.

This explores what alternatives to plain query-key dot product decide what an attention-style associative memory stores and retrieves. Standard transformer attention is, at bottom, a dot-product associative memory: a query is matched against keys by inner product, and the highest-scoring values get pulled forward. The corpus has several notes that, read together, show different objectives competing to govern that matching — and they don't all optimize for the same thing.

The first competing objective is **surprise** rather than similarity. The Titans architecture Can neural memory modules scale language models beyond attention limits? splits short-term attention (quadratic, dot-product based) from a long-term neural memory that prioritizes *surprising* tokens for storage. Instead of asking "what is most similar to my query," the memory asks "what violated my expectations enough to be worth keeping" — a gradient-of-surprise signal, not a dot product. That's a fundamentally different write objective, and it's what lets the model stretch past 2M tokens without paying attention's quadratic cost.

The second is a **structural prominence bias** baked into soft attention itself. The note on attention's bias toward repeated content Does transformer attention architecture inherently favor repeated content? shows that softmax doesn't weight purely by relevance — it systematically over-weights tokens that are repeated or context-prominent, creating a feedback loop that amplifies framing regardless of whether it answers the query. So even within dot-product attention, there's a hidden objective (prominence) riding alongside similarity, which is part of why sycophancy emerges. "System 2 Attention" — regenerating the context to strip irrelevant material — is essentially an attempt to subtract that competing bias out.

The third is the **learned-similarity-function** contest, and here the corpus is unusually direct: dot product wins. Two notes on Rendle et al. Can MLPs learn to match dot product similarity in practice? Why does dot product beat MLP-based similarity in practice? show that replacing the inner product with an MLP that *learns* its own similarity metric underperforms a well-tuned dot product, despite the MLP being a universal approximator. The inductive bias of geometric similarity beats raw expressiveness, and the dot product also survives where the MLP can't: it's the only form that supports efficient maximum-inner-product retrieval at scale. So the objective that competes hardest on paper — a freely learned similarity — loses on both accuracy and retrievability.

The through-line a curious reader might not expect: the dot product isn't winning because it's the most powerful matcher, but because the alternatives each trade something away. Surprise-based memory gives up similarity for compression and reach; prominence bias is an *unwanted* objective attention can't help but optimize; and learned MLP similarity gives up the geometric structure that makes retrieval tractable. Associative memory is less a search for the best similarity score than a negotiation among these competing pressures.

Sources 4 notes

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Titans: Learning to Memorize at Test Time2.43 match · arxiv ↗
Neural Collaborative Filtering vs. Matrix Factorization Revisited1.84 match · arxiv ↗
KAN: Kolmogorov-Arnold Networks1.58 match · arxiv ↗
Curse of “Low” Dimensionality in Recommender Systems1.57 match · arxiv ↗
Deep Interest Network for Click-Through Rate Prediction1.52 match · arxiv ↗
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities1.52 match · arxiv ↗
On the Theoretical Limitations of Embedding-Based Retrieval1.52 match · arxiv ↗
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about what objectives compete with dot-product similarity in associative memory. This remains an open question in efficient attention and long-context modeling.

What a curated library found — and when (findings span 2016–2026; treat as dated claims):
• Surprise-based memory (Titans, ~2025) prioritizes unexpected tokens over similarity for long-context storage, scaling past 2M tokens without quadratic cost.
• Softmax attention exhibits structural bias toward repeated/context-prominent content independent of relevance, amplifying sycophancy (System 2 Attention, ~2023).
• MLP-learned similarity functions underperform dot product in both accuracy and maximum-inner-product retrieval tractability, despite universal approximation capacity (~2023).
• Dot product's inductive bias (geometric structure) survives where learned similarity trades off retrieval efficiency; this is not dominance through power but through constraint satisfaction.

Anchor papers (verify; mind their dates):
• arXiv:2501.00663 (Titans, 2025) — surprise-based long-term memory
• arXiv:2311.11829 (System 2 Attention, 2023) — prominence bias diagnosis
• arXiv:2305.13597 (Curse of "Low" Dimensionality, 2023) — learned similarity vs. dot product

Your task:
(1) RE-TEST EACH CONSTRAINT. For surprise-based memory, learned similarity, and prominence bias: have newer models (Llama 3.x, Claude 4, o1-preview) or training methods (DPO, PPO-derived tuning, retrieval-augmented training) since relaxed these trade-offs? Has efficient attention (FlashAttention 3, sparse methods) or caching/KV compression obsoleted the surprise objective's advantage? Does prominence bias still emerge in instruction-tuned models post-2025? Separate what remains a real limitation from what newer methods have absorbed.
(2) Surface the strongest contradicting or superseding work from the last ~6 months — any paper showing learned similarity *can* match dot product at scale, or prominence bias isn't a problem in practice, or surprise-based memory doesn't scale.
(3) Propose 2 research questions assuming the regime has moved: (a) If dot product's dominance rests on retrieval tractability rather than expressiveness, does hardware-accelerated learned similarity (e.g., GPU-native MLP kernels) change the calculus? (b) Do multi-objective losses (similarity + surprise + prominence correction) outperform single-objective attention in long-context or adversarial robustness settings?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Transformers remember what matches your query — but what if surprise or repetition are smarter memory triggers?

Related lines of inquiry

Sources 4 notes

Papers this line draws on 8