SYNTHESIS NOTE

Can consistency models trade speed for quality with a few steps?

Consistency models sample in one step but sacrifice quality compared to diffusion. Can adding just a handful of sampling steps recover the quality gap while staying faster than full diffusion?

Synthesis note · 2026-06-03 · sourced from Diffusion LLM

Diffusion models are easy to train but slow to sample (many function evaluations); consistency models sample in a single step but are hard to train and sacrifice quality. Multistep Consistency Models unify Consistency Models and TRACT into a single dial: a 1-step model is a conventional consistency model, while an ∞-step model is a diffusion model — so the method interpolates between the two. The practical finding is that a small budget increase (1 → 2–8 steps) makes models much easier to train and yields higher-quality samples while retaining most of the single-step speed advantage — closing the quality gap to standard diffusion in as few as 8 steps, and scaling to text-to-image.

The keeper is the framing of sampling-steps as a continuous quality-speed trade-off rather than a binary choice between "fast but worse" (consistency) and "slow but best" (diffusion). The hard single-step regime was the wrong target; a handful of steps recovers most quality at most of the speed.

This sits in the vault's diffusion thread as a sampling-efficiency contribution. It pairs with Can generating entire videos at once beat keyframe interpolation? (Lumiere) as another rethinking of the diffusion generation budget, and the speed-quality dial mirrors the test-time-compute trade-offs seen on the language side.

Easy Consistency Tuning makes the diffusion→consistency conversion cheap (ECT, https://arxiv.org/abs/2406.14548). The same "diffusion is a special case of consistency" view powers a training-efficiency result: rather than training a consistency model from scratch (a week on 8 GPUs as of 2024), ECT fine-tunes a pretrained diffusion model and progressively approximates the full consistency condition over training. It reaches a 2-step FID of 2.73 on CIFAR-10 in ~1 hour on a single A100 — matching Consistency Distillation that took hundreds of GPU-hours — and the resulting consistency models obey classic power-law scaling, suggesting they improve with scale. So Multistep Consistency dials inference steps along the diffusion↔consistency continuum; ECT exploits the same continuum to make training the consistency model cheap by starting from diffusion. (ECT's one limitation: it needs the dataset, unlike data-free distillation.)

Inquiring lines that read this note 2

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What makes weaker teacher models effective for stronger student training?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Can consistency models trade speed for quality w… Can generating entire videos at once beat keyframe… How should we balance parallel versus sequential c…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can generating entire videos at once beat keyframe interpolation? Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.
sibling diffusion-budget rethink on the video side
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
the steps-vs-quality dial echoes the test-time compute trade-off on the language side

Can consistency models trade speed for quality with a few steps?

Inquiring lines that read this note 2

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4