MiniMax Sparse Attention

Paper · arXiv 2606.13392 · Published June 11, 2026
Novel LLM Architectures

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repositoryscale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens—yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key–value blocks and independently selects a Top-ksubset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-kselection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4× at 1M context.

Introduction. Large language models (LLMs) are rapidly shifting from short, single-turn interactions to longhorizon agentic workflows that span hundreds of interleaved reasoning and action steps—writing and deploying production code, navigating the open web, orchestrating diverse tools, and producing structured documents (Anthropic, 2025; DeepSeek-AI, 2026; Google DeepMind, 2025; Moonshot AI, 2026; OpenAI, 2025; Zhipu AI, 2026). However, the ultra-long contexts these tasks demand impose severe compute and memory bottlenecks on both training and inference, with quadratic-cost softmax attention being the primary culprit, further amplified by the latency and throughput constraints of production-scale deployment. Context length is a critical scaling dimension for LLMs, where trading off model quality against efficiency remains a formidable challenge. The community is actively pushing the Pareto frontier on this front.

Discussion / Conclusion. We introduced MSA, a sparse-attention mechanism co-designed with Grouped-Query Attention. The architecture attaches a lightweight Index Branch to a standard GQA layer: each GQA group independently selects a small set of key-value blocks through a block-level dot-product indexer, and MiniMax Sparse Attention the Main Branch performs softmax attention restricted to the selected blocks. The Index Branch is a pure selector and is trained by a KL alignment loss against the Main Branch under a two-stage warmup schedule and a stop-gradient on the index input that confines the auxiliary loss to the index projections. At the 109B-MoE scale, MSA preserves the capability of a GQA Full-Attention baseline across most pretraining and agentic benchmarks while reducing per-token attention compute by 28.4× at 1M context, the regime in which long-context inference becomes the binding deployment constraint.