INQUIRING LINE

Can retrofitted sparse attention ever match jointly-trained sparse attention?

This explores whether sparse attention bolted onto an already-trained dense model can rival sparse attention that was learned from scratch during pretraining — and what the corpus says about why that gap exists.


This question is really asking whether sparsity is a property you can patch in after the fact, or one that has to be grown into the model from the beginning. The strongest signal in the corpus leans hard toward the latter. MiniMax's result is the cleanest example: its block selector — the part that decides which chunks of context to attend to — is trained end-to-end during native pretraining, and that's exactly why it hits a 28× per-token attention compute reduction at a million tokens while still matching full dense attention at 109B parameters Can sparse attention match dense models without retrofitting?. The framing there is pointed: joint training makes sparsity a 'frontier move' rather than an 'efficiency patch.' That word *patch* is the retrofit case, and the implication is that retrofitting buys you speed but not parity.

Why would the trained-from-scratch version pull ahead? Because what gets attended to isn't a fixed routing rule — it's something the model learns alongside everything else. The corpus shows sparsity in LLMs is fundamentally *learned*, not bolted on: networks develop dense activations for familiar data and fall back to sparse representations for unfamiliar inputs, and this emerges through pretraining exposure without any task-specific tuning Is representational sparsity learned or intrinsic to neural networks?. Relatedly, models sparsify their hidden states adaptively as tasks get harder or more out-of-distribution, using that sparsity as a selective filter rather than suffering it as a failure Do language models sparsify their activations under difficult tasks?. The thread connecting these: a model that trained with sparsity has co-adapted its representations to it. A retrofit asks a dense-trained network to suddenly route through pathways it never optimized for — you're imposing a structure the weights weren't shaped around.

The more interesting reframe is whether "match" is even the right goal. The Sparse Frontier work shows that at equal compute, larger sparse models beat smaller dense ones on long-context tasks — sparsity isn't trading quality for speed, it's expanding the cost-performance frontier and letting you train bigger models in the same budget Does sparse attention trade off quality for speed?. That Pareto-improving benefit only shows up when sparsity is part of the training budget decision from the start. A retrofit, by definition, has already spent its budget on a dense model; it can recover efficiency but it can't retroactively claim the larger-model-for-the-same-cost advantage that makes native sparse training a win rather than a compromise.

Where a retrofit might genuinely close the gap is when sparsity is moved out of the attention mechanism and into a separate, trainable module. Titans, for instance, doesn't try to sparsify attention itself — it keeps attention as quadratic short-term memory and adds a neural long-term memory that learns which surprising tokens to keep, scaling past 2M tokens Can neural memory modules scale language models beyond attention limits?. That's an architectural addition you can in principle train onto an existing backbone, sidestepping the co-adaptation problem because the new module learns its own routing rather than rewiring the old attention.

The honest synthesis: the corpus has no head-to-head benchmark of a retrofitted sparse selector against a jointly-trained one, so a strict numerical answer isn't here. But the converging evidence — that effective sparsity is learned through training exposure, that it co-adapts with representations, and that its biggest payoff is a budget decision made before pretraining — suggests a retrofit can approach native sparse attention on raw efficiency but is structurally disadvantaged on matching its quality, unless it routes the sparsity through a fresh trainable module rather than reshaping attention the model already locked in.


Sources 5 notes

Can sparse attention match dense models without retrofitting?

MiniMax Sparse Attention achieves 28.4× per-token attention compute reduction at 1M context while matching full-attention GQA performance at 109B, because its block selector is trained end-to-end during native pretraining rather than retrofitted. This proves sparsity can be a frontier move, not just an efficiency patch.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Next inquiring lines