INQUIRING LINE

Inquiring lines›How do language models construct a…›How do dialogue systems achieve ge…›How do transformer attention mecha…›this inquiring line

Is linear attention really why some transformers run faster, or just one ingredient in a bigger efficiency recipe?

Does attention linearity alone explain the efficiency gains over standard transformers?

This explores whether making attention linear (so cost grows in step with sequence length instead of with its square) is itself the source of efficiency gains — or whether the corpus shows efficiency coming from several other places too.

This explores whether "linear attention" is the single lever behind faster, leaner transformers, or just one move among many. The short answer from the corpus: linearity rarely travels alone. The systems that actually win on efficiency tend to pair it with other tricks, and several of the biggest gains come from places that have nothing to do with attention's math at all.

Start with the case that looks closest to "linearity is the answer." SpikingBrain converts an existing Qwen checkpoint into a near-linear-complexity model for long sequences — but notice it does so by combining linear/hybrid-linear attention *with* adaptive spiking neurons, and it leans on the fact that you can recover most of the performance with under 2% retraining (Can spiking neurons make transformers efficient on any hardware?). The efficiency story there is as much about cheap conversion and spiking sparsity as about the attention kernel. The honest counterweight is that pure linear-state models pay for their thrift: transformers provably beat state-space (fixed-size-state) models at copying and retrieving from context, because a compressed running state simply can't hold an arbitrarily long string the way full attention can (Can state-space models match transformers at copying and retrieval?). So linearity buys speed by throwing away the very capability that made attention expensive — it's a trade, not a free lunch.

That's why the more interesting architectures don't go fully linear; they split the job. Titans keeps quadratic attention for short-term, exact recall and bolts on a separate neural memory module that compresses and stores only "surprising" tokens, reaching 2M+ context without the quadratic penalty (Can neural memory modules scale language models beyond attention limits?). TransformerFAM gets unbounded-length processing by adding a feedback loop that lets the model attend to its own latents — and crucially adds *no extra weights* (Can models learn working memory by attending to their own latents?). In both, the efficiency comes from *what you choose to remember*, not from rewriting attention as a linear operation.

And some of the largest practical gains in the corpus sit entirely outside attention. MobileLLM finds that on memory-bound hardware the bottleneck is moving weights, not computing them — so sharing weights across blocks and recomputing them beats fetching fresh ones, a latency win with zero change to attention's form (Does recomputing weights cost less than moving them on mobile?). The Hierarchical Reasoning Model reaches problems fixed-depth transformers can't, with only 27M parameters, by using two recurrent timescales for effective depth (Can recurrent hierarchies achieve reasoning that transformers cannot?) — efficiency as architecture, not as attention arithmetic. Even within attention, there's a wrinkle that complicates any clean "linear vs. softmax" framing: a handful of input-agnostic "massive activations" quietly act as implicit attention bias terms (Do hidden massive activations act as attention bias terms?), and softmax attention carries structural baggage of its own, over-weighting repeated and prominent tokens regardless of relevance (Does transformer attention architecture inherently favor repeated content?).

So no — linearity alone doesn't explain the gains, and reading the corpus laterally, the better mental model is a portfolio: linear/hybrid kernels handle long-range cheapness, selective memory modules preserve recall, hardware-aware weight schemes cut data movement, and recurrent depth substitutes compute for parameters. The thing worth taking away is that "efficient transformer" is almost never one idea — it's a negotiated settlement between speed and the exact-recall ability that full attention uniquely provides.

Sources 8 notes

Can spiking neurons make transformers efficient on any hardware?

SpikingBrain successfully adapted Qwen2.5-7B using under 2% retraining data by combining linear/hybrid-linear attention with adaptive spiking neurons, achieving transformer-comparable performance with near-linear long-sequence complexity on non-NVIDIA hardware.

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can models learn working memory by attending to their own latents?

TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Show all 8 sources

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do hidden massive activations act as attention bias terms?

A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Titans: Learning to Memorize at Test Time4.14 match · arxiv ↗
The Topological Trouble With Transformers4.08 match · arxiv ↗
It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization3.16 match · arxiv ↗
Repeat After Me: Transformers are Better than State Space Models at Copying2.53 match · arxiv ↗
Differential Transformer2.39 match · arxiv ↗
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters2.35 match · arxiv ↗
Hierarchical Reasoning Model1.72 match · arxiv ↗
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach1.71 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems researcher evaluating whether linear attention kernels are the primary efficiency lever in modern transformers, or one tactic among many. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. A library across this range identified:

• Linear/hybrid attention alone rarely wins: SpikingBrain pairs linear kernels with adaptive spiking sparsity (~2% retraining cost) and still relies on spiking neurons for the real gains (~2025).
• State-space models provably lose on long-context recall: transformers beat fixed-size-state architectures at copying and arbitrarily long retrieval, because full attention preserves context where linear compression cannot (~2024).
• Selective memory beats pure linearity: Titans (2024–2025) keeps quadratic attention for short-term recall and adds a separate neural memory module that stores only "surprising" tokens, reaching 2M+ context without full quadratic cost.
• Feedback loops + no new weights work: TransformerFAM lets models attend to their own latents for unbounded-length processing without parameter increase (~2024).
• Hardware data movement, not compute, is the bottleneck: MobileLLM achieves efficiency via weight sharing and recomputation across blocks (~2024), orthogonal to attention form.
• Recurrent depth substitutes for parameters: Hierarchical Reasoning Model reaches 27M parameters with two timescales, solving problems standard transformers cannot (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.01032 (2024): Repeat After Me — transformers outperform state-space models on copying.
• arXiv:2501.00663 (2025): Titans — neural memory modules + quadratic attention for hybrid efficiency.
• arXiv:2402.14905 (2024): MobileLLM — weight sharing and hardware-aware latency.
• arXiv:2509.05276 (2025): SpikingBrain — spiking neurons + linear attention in checkpoint conversion.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether models released in the last 6 months (e.g., Llama 3.2, Claude 3.5, GPT-4o variants), new training methods (e.g., continued pretraining, scaling laws for hybrid architectures), new evaluation harnesses (long-context evals, retrieval recall on 10M+ token windows), or orchestration (multi-agent memory pooling, hierarchical caching), have since relaxed or overturned it. Separate durable from perishable: Is "linear attention alone insufficient" still true? Has any pure-linear system now matched quadratic attention on long-context recall without auxiliary modules? Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper claiming pure linearity *does* suffice, or new evidence that the portfolio view is outdated.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do emergent scaling laws for hybrid architectures now favor one hybrid ratio over others?" or "Can in-context learning substitute for the selective-memory module in Titans?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Is linear attention really why some transformers run faster, or just one ingredient in a bigger efficiency recipe?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8