Does looping layers beat adding depth in diffusion models?
When scaling masked diffusion language models with fixed parameters, is reusing computation through selective layer looping more efficient than simply making the network deeper? This matters because it challenges conventional scaling assumptions.
Masked diffusion models (MDMs) have become a competitive alternative to autoregressive models for language, but improvement has mostly come through parameters and training tokens. LoopMDM asks how to improve along a different axis by importing the looped transformer from the AR literature: apply a shared block repeatedly, converting depth into loops at fixed parameter cost.
The specific finding is that selective looping of the early-middle layers — not the whole network — significantly improves both training efficiency and performance. Looping at train time yields a depth-scaling effect without adding parameters, while varying the loop count at inference enables flexible compute scaling. The numbers are striking: LoopMDM matches same-size MDMs with up to 3.3× fewer training FLOPs, and its final performance exceeds them on reasoning benchmarks (up to +8.5 on GSM8K). Most tellingly, it surpasses deeper non-looped MDMs trained with comparable per-step compute — so the gain is not just "more depth."
The conceptual takeaway is that reusing computation is more effective than adding depth under fixed parameter and compute budgets, at least for diffusion LMs. That this works in MDMs (where it had not been explored) extends the looped-architecture story beyond autoregression. It pairs naturally with How do looped transformer layers actually behave during inference?, which explains why reused computation helps: the loop re-applies the same stages of inference rather than computing genuinely new ones.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How do looped transformer layers actually behave during inference?
When language models loop their layers to improve reasoning, do they discover new computations or repeat existing ones? Understanding the internal dynamics could explain why recurrent architectures outperform simple depth scaling.
the mechanistic explanation for why looping works
-
Can looped transformers generalize to unseen knowledge combinations?
Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.
looping as a route to capability vanilla fixed-depth models lack
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
same escape from fixed-depth constraints, different recurrence structure
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Looped Diffusion Language Models
- A Survey on Diffusion Language Models
- The Serial Scaling Hypothesis
- The Unreasonable Ineffectiveness of the Deeper Layers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Learn from your own latents and not from tokens: A sample-complexity theory
- The Vanishing Gradient Problem for Stiff Neural Differential Equations
Original note title
selective layer looping beats naive depth scaling in masked diffusion language models — reused computation outperforms added depth at fixed parameters