Diffusion-Based LLMs

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Can speech features be separated into semantic and stylistic components?

Linguistic theory suggests gestures decompose into semantic units and motion variations. Does this decomposition actually emerge in speech encoder layers, and can it enable more expressive gesture synthesis?

Can diffusion models enable control that autoregressive models cannot reach?

Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?

Diffusion-Based LLMs

Why can't we easily adapt reinforcement learning to diffusion language models?

Can speech features be separated into semantic and stylistic components?

Can diffusion models enable control that autoregressive models cannot reach?

Can diffusion language models match autoregressive inference speed?

Can diffusion models commit to answers before full decoding?

Can diffusion models perform evolutionary search in parameter space?

Can reasoning and answers be generated separately in language models?

Can consistency models trade speed for quality with a few steps?

Can iterative revision cycles match how humans actually write?

Does autoregressive generation uniquely enable LLM scaling?