SpikingBrain: Spiking Brain-inspired Large Models

Paper · arXiv 2509.05276 · Published September 5, 2025
Novel LLM Architectures

Mainstream Transformer-based large language models (LLMs) face significant efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly. These constraints limit their ability to process long sequences effectively. In addition, building large models on non-NVIDIA computing platforms poses major challenges in achieving stable and efficient training and deployment. To address these issues, we introduce SpikingBrain, a new family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX1 GPU cluster and focuses on three core aspects: i) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; ii) Algorithmic Optimizations: an efficient, conversion-based training pipeline compatible with existing LLMs, along with a dedicated spike coding framework; iii) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to the MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms.

Introduction. Recent advances in large language models (LLMs) built on the Transformer architecture [1] have been driven by the scaling law [2], which suggests that performance improves with larger model sizes and more data [3, 4, 5, 6]. However, this scale-driven approach comes with significant challenges: extremely high training costs, high energy consumption, and complex deployment pipelines. Therefore, achieving high performance and energy efficiency under limited resources has become a critical research goal. To address this, our work draws inspiration from brain mechanisms. We explore novel architectures, training paradigms, and spike coding schemes to develop efficient, brain-inspired LLMs that move beyond the traditional Transformer framework. A key focus of this study is to validate the training and deployment of such models on non-NVIDIA computing clusters. We use an open-source Transformer checkpoint (Qwen2.5-7B-base [7] as an example) together with our efficient development framework to train and evaluate two models on the MetaX GPU cluster.

Discussion / Conclusion. This work provides a comprehensive demonstration of efficient brain-inspired large model training on the MetaX GPU cluster. By integrating several key techniques—including novel (hybrid) linear architectures beyond Transformers, sparse MoE design, lightweight conversion training, and adaptivethreshold sparse spiking activation—we validate and implement a practical development pipeline for spiking-based large models on a non-NVIDIA cluster with hundreds of GPUs. We release two models as outcomes of this effort: the linear model SpikingBrain-7B and the MoE hybrid-linear model SpikingBrain-76B-A12B. These models offer two main advantages: i) Training efficiency: With linear or near-linear complexity, they substantially accelerate long-sequence training and match the performance of many open-source Transformer models while using less than 2% of the training data.