Can sparse attention match dense models without retrofitting?
Does training sparse-attention mechanisms jointly during pretraining—rather than retrofitting them onto finished models—allow them to reach full-model performance at frontier scale? This matters because it challenges whether sparsity is inherently a quality trade-off.
Most sparse-attention proposals are evaluated as efficiency patches: take a trained dense model, approximate its attention, accept some quality loss. MiniMax Sparse Attention (MSA) is interesting precisely because it refuses that framing. The block selector — a lightweight Index Branch that scores key-value blocks per GQA group and keeps a Top-k subset — is trained during native multimodal pretraining of a 109B-MoE model, aligned to the full-attention branch by a KL loss with a stop-gradient confining the auxiliary objective to the index projections. The payoff is that at 1M context it cuts per-token attention compute 28.4x while staying on par with a full-attention GQA baseline, rather than below it.
This sharpens a tension the vault already holds. Does sparse attention trade off quality for speed? argues sparsity is not a quality-cost trade but a frontier move; MSA supplies the production-scale existence proof — but only because selection is learned end-to-end, which is the part the Sparse Frontier benchmark abstracts away. It also operationalizes Does fixed sparsity work for all sequence lengths?: MSA's per-group Top-k is a budget, and the honest open question is whether a static k repeats the fixed-budget mistake at the short end.
The counterargument worth keeping is the hard limit from Can state-space models match transformers at copying and retrieval?: any scheme that discards key-value blocks risks the same retrieval failures as a compressed state. MSA dodges this by keeping full KV and selecting blocks per token rather than compressing — a meaningfully different bet than linear attention. The deeper point for writing: the credible efficiency story is no longer "approximate a dense model" but "co-design the sparsity with the GPU execution path and the pretraining objective from the start."
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does sparse attention trade off quality for speed?
When sparse attention is compared fairly—larger sparse models versus smaller dense ones at the same compute cost—does it still represent a quality-cost trade-off, or does it actually improve performance?
exemplifies (production-scale proof of the Pareto claim)
-
Does fixed sparsity work for all sequence lengths?
Production systems often apply the same sparsity budget regardless of input length. Does this one-size-fits-all approach actually work across short and long contexts, or does optimal sparsity vary with sequence length?
extends (per-group Top-k inherits the fixed-budget risk)
-
Can state-space models match transformers at copying and retrieval?
Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.
grounds (why MSA keeps full KV rather than compressing)
-
Can spiking neurons make transformers efficient on any hardware?
Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.
convergent-with (rival route to long-context efficiency: retrofit vs native training)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- MiniMax Sparse Attention
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- Beyond Language Modeling: An Exploration of Multimodal Pretraining
- Jamba: A Hybrid Transformer-Mamba Language Model
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Original note title
sparse attention earns its place by surviving native pretraining at frontier scale not by being bolted onto a finished model