Looped Diffusion Language Models
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models for language modeling, yet the effective design of transformer architectures for MDMs remains underexplored. In this paper, we show that selectively looping the early-middle transformer layers significantly improves both training efficiency and model performance in MDMs. We call this approach LoopMDM (Looped Masked Diffusion Model), which brings two key benefits: looping layers at training-time yields a depth-scaling effect without adding parameters, while varying the number of loops at inference-time enables flexible compute scaling. Despite the simplicity, the results are striking: across multiple pre-training corpora, LoopMDM matches the performance of same-size MDMs with up to 3.3× fewer training FLOPs, while its final performance outperforms them on various reasoning benchmarks, including up to +8.5 points on GSM8K. It even surpasses deeper non-looped MDMs trained with comparable per-step compute, indicating that selective looping is more effective than naive depth scaling. Furthermore, LoopMDM can scale inference-time compute by increasing the number of loops.
Introduction. Masked diffusion models (MDMs) [2, 38, 59, 65] have emerged as a competitive alternative to autoregressive models (ARMs) for language modeling. Recent advances in training objectives [53], noise schedules [27], and architectures [7, 17] have steadily narrowed the gap with ARMs, establishing MDMs as an increasingly viable framework for text generation. Building on these improvements, recent work has further pushed MDMs toward larger-scale models, including the LLaDA family [47, 80, 4], Dream [76], and Seed Diffusion [66], to narrow the gap even further. Although these approaches demonstrate the promise of scaling, they also raise the question of how to improve performance along axes other than parameters and training tokens. A natural candidate from the AR literature is the looped transformer [15, 20, 19, 81]. Looped architectures apply a shared block repeatedly to convert depth into loops at a fixed parameter cost. Looping has been explored in ARMs as a form of test-time compute scaling, but its role in MDMs has not been explored.
Discussion / Conclusion. We introduced LoopMDM, which selectively loops a small shared block inside masked diffusion language models. LoopMDM improves matched-compute performance across language modeling and reasoning tasks without increasing parameter count, and achieves comparable or better performance with fewer training FLOPs than non-looped baselines. These results suggest that the gains from looping are not solely explained by increased depth, but are consistent with benefits from reusing computation within the diffusion architecture. More broadly, our findings indicate that selective looping can serve as a simple approach for improving masked diffusion models under fixed parameter and compute budgets. We believe our work opens up broad opportunities for architectural advances in improving training efficiency and generalization. We provide limitations and future work in Appendix E.