Large Language Diffusion Models
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised finetuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs. Project page and codes: https: //ml-gsai.github.io/LLaDA-demo/.
Introduction. Large language models (LLMs) (Zhao et al., 2023) fall entirely within the framework of generative modeling. Specifically, LLMs aim to capture the true but unknown language We argue that the answer is not a simple “yes”. The key insight overlooked previously is: it is the generative modeling principles (i.e., Eq. (1)), rather than the autoregressive formulation (i.e., Eq. (2)) itself, that fundamentally underpin the essential properties of LLMs, as detailed below. However, certain inherent limitations of LLMs can be directly traced to their autoregressive nature. Nevertheless, the autoregressive nature of LLMs presents notable challenges. For example, sequential token-by-token generation incurs high computational costs, and the leftto-right modeling limits effectiveness in reversal reasoning tasks (Berglund et al., 2023). These inherent limitations constrain LLMs in handling longer and more complex tasks.
Discussion / Conclusion. We introduce LLaDA, a principled and previously unexplored approach to large language modeling based on diffusion models. LLaDA demonstrates strong capabilities in scalability, in-context learning, and instruction-following, achieving performance comparable to strong LLMs. In addition, LLaDA offers unique advantages such as bidirectional modeling and enhanced robustness, effectively addressing several inherent limitations of existing LLMs. Our findings not only establish diffusion models as a viable and promising alternative but also challenge the prevailing assumption that these essential capabilities are inherently tied to ARMs. Due to computational constraints, direct comparisons between LLaDA and ARMs—such as training on identical datasets—were restricted to a computational budget of less than 1023 FLOPs. To allocate resources for training the largest possible LLaDA model and showcasing its potential, we were unable to scale the ARM baseline to the same extent. Moreover, no specialized attention mechanisms or position embeddings were designed for LLaDA, nor were any system-level architectural optimizations applied.