INQUIRING LINE

Can gradient-based control reach properties that autoregressive methods cannot?

This explores whether generating text by tweaking a whole sequence at once with gradients (as diffusion models do) can hit targets that strict left-to-right, one-token-at-a-time generation structurally can't.


This explores whether gradient-based control over a whole sequence reaches properties that token-by-token autoregressive generation cannot. The clearest 'yes' in the corpus comes from Diffusion-LM, which represents text as continuous latent variables so gradients flow across the entire sequence at once. That lets it steer global properties — syntax, semantics, length, infilling — on six fine-grained control tasks where plug-and-play methods bolted onto autoregressive models fail Can diffusion models enable control that autoregressive models cannot reach?. The mechanism matters: autoregressive generation commits to each token and can never take it back, so problems that require discarding a bad partial choice hit an architectural ceiling, not a model-quality one. Constraint satisfaction is the sharp example — solvers depend on retracting invalid partial assignments, a primitive autoregressive transformers simply lack Why does autoregressive generation fail at constraint satisfaction?. Gradient-based, parallel denoising sidesteps that bottleneck by revising the whole sequence rather than emitting it irreversibly.

The deeper claim is that autoregressive factorization isn't load-bearing for the things we credit it with. LLaDA shows non-autoregressive diffusion models match autoregressive scaling, suggesting that LLM scalability comes from transformers, data, and Fisher-consistent training — not from left-to-right generation per se Does autoregressive generation uniquely enable LLM scaling?. If autoregression is contingent rather than necessary, then the controllability gap is a real, distinct payoff: diffusion buys you global gradient steering without giving up scale.

But 'reaches what AR cannot' cuts both ways, and this is the part a reader might not expect. The same parallel structure that unlocks gradient control breaks the tools built for autoregressive models. Reinforcement learning methods like GRPO and DPO rely on a clean log-likelihood factorization over a token sequence; diffusion's non-sequential denoising makes that likelihood intractable, because you'd have to marginalize over denoising trajectories Why can't we easily adapt reinforcement learning to diffusion language models?. So the property frontier isn't strictly 'diffusion ⊇ autoregressive.' Each paradigm reaches control surfaces the other can't easily touch: gradients for global properties on one side, mature likelihood-based RL post-training on the other.

That trade frames why gradient-based control is attractive even inside the autoregressive world, where it shows up as a data and parameter lever rather than a generation method. Gradient-similarity selection (LESS) uses low-rank gradient features to pick the 5% of instruction data closest to a target capability, beating full-dataset training Can we train better models on less data?. And RL itself turns out to edit only sparse, full-rank subnetworks — structured, seed-stable parameter selection rather than diffuse change Does reinforcement learning update only a small fraction of parameters?. Both hint that gradient signals carry precise, targetable structure; diffusion just exposes that structure at generation time instead of only at training time.

The honest answer: yes, gradient-based control reaches global text properties that autoregressive generation structurally cannot, because it replaces irreversible token commitment with whole-sequence revision. The catch is that you trade away the very factorization that makes autoregressive RL tractable — so the open frontier is hybrid: keeping gradient controllability while recovering preference optimization and reward shaping for non-sequential models.


Sources 6 notes

Can diffusion models enable control that autoregressive models cannot reach?

Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Does autoregressive generation uniquely enable LLM scaling?

LLaDA demonstrates that non-autoregressive diffusion models match autoregressive scaling performance. This suggests scalability emerges from the interplay of architecture, dataset size, and Fisher-consistent principles—meaning autoregressive factorization is contingent rather than necessary.

Why can't we easily adapt reinforcement learning to diffusion language models?

Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Next inquiring lines