SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling

Can consistency models trade speed for quality with a few steps?

Consistency models sample in one step but sacrifice quality compared to diffusion. Can adding just a handful of sampling steps recover the quality gap while staying faster than full diffusion?

Synthesis note · 2026-06-03 · sourced from Diffusion LLM

Diffusion models are easy to train but slow to sample (many function evaluations); consistency models sample in a single step but are hard to train and sacrifice quality. Multistep Consistency Models unify Consistency Models and TRACT into a single dial: a 1-step model is a conventional consistency model, while an ∞-step model is a diffusion model — so the method interpolates between the two. The practical finding is that a small budget increase (1 → 2–8 steps) makes models much easier to train and yields higher-quality samples while retaining most of the single-step speed advantage — closing the quality gap to standard diffusion in as few as 8 steps, and scaling to text-to-image.

The keeper is the framing of sampling-steps as a continuous quality-speed trade-off rather than a binary choice between "fast but worse" (consistency) and "slow but best" (diffusion). The hard single-step regime was the wrong target; a handful of steps recovers most quality at most of the speed.

This sits in the vault's diffusion thread as a sampling-efficiency contribution. It pairs with Can generating entire videos at once beat keyframe interpolation? (Lumiere) as another rethinking of the diffusion generation budget, and the speed-quality dial mirrors the test-time-compute trade-offs seen on the language side.

Easy Consistency Tuning makes the diffusion→consistency conversion cheap (ECT, https://arxiv.org/abs/2406.14548). The same "diffusion is a special case of consistency" view powers a training-efficiency result: rather than training a consistency model from scratch (a week on 8 GPUs as of 2024), ECT fine-tunes a pretrained diffusion model and progressively approximates the full consistency condition over training. It reaches a 2-step FID of 2.73 on CIFAR-10 in ~1 hour on a single A100 — matching Consistency Distillation that took hundreds of GPU-hours — and the resulting consistency models obey classic power-law scaling, suggesting they improve with scale. So Multistep Consistency dials inference steps along the diffusion↔consistency continuum; ECT exploits the same continuum to make training the consistency model cheap by starting from diffusion. (ECT's one limitation: it needs the dataset, unlike data-free distillation.)

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 148 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multistep consistency models interpolate between one-step consistency and many-step diffusion to trade sampling speed for quality