Can consistency models trade speed for quality with a few steps?
Consistency models sample in one step but sacrifice quality compared to diffusion. Can adding just a handful of sampling steps recover the quality gap while staying faster than full diffusion?
Diffusion models are easy to train but slow to sample (many function evaluations); consistency models sample in a single step but are hard to train and sacrifice quality. Multistep Consistency Models unify Consistency Models and TRACT into a single dial: a 1-step model is a conventional consistency model, while an ∞-step model is a diffusion model — so the method interpolates between the two. The practical finding is that a small budget increase (1 → 2–8 steps) makes models much easier to train and yields higher-quality samples while retaining most of the single-step speed advantage — closing the quality gap to standard diffusion in as few as 8 steps, and scaling to text-to-image.
The keeper is the framing of sampling-steps as a continuous quality-speed trade-off rather than a binary choice between "fast but worse" (consistency) and "slow but best" (diffusion). The hard single-step regime was the wrong target; a handful of steps recovers most quality at most of the speed.
This sits in the vault's diffusion thread as a sampling-efficiency contribution. It pairs with Can generating entire videos at once beat keyframe interpolation? (Lumiere) as another rethinking of the diffusion generation budget, and the speed-quality dial mirrors the test-time-compute trade-offs seen on the language side.
Easy Consistency Tuning makes the diffusion→consistency conversion cheap (ECT, https://arxiv.org/abs/2406.14548). The same "diffusion is a special case of consistency" view powers a training-efficiency result: rather than training a consistency model from scratch (a week on 8 GPUs as of 2024), ECT fine-tunes a pretrained diffusion model and progressively approximates the full consistency condition over training. It reaches a 2-step FID of 2.73 on CIFAR-10 in ~1 hour on a single A100 — matching Consistency Distillation that took hundreds of GPU-hours — and the resulting consistency models obey classic power-law scaling, suggesting they improve with scale. So Multistep Consistency dials inference steps along the diffusion↔consistency continuum; ECT exploits the same continuum to make training the consistency model cheap by starting from diffusion. (ECT's one limitation: it needs the dataset, unlike data-free distillation.)
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can generating entire videos at once beat keyframe interpolation?
Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.
sibling diffusion-budget rethink on the video side
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
the steps-vs-quality dial echoes the test-time compute trade-off on the language side
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Multistep Consistency Models
- Consistency Models Made Easy
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Consistency Training Helps Stop Sycophancy and Jailbreaks
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
- Progressive-Hint Prompting Improves Reasoning in Large Language Models
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
- Self-consistency Improves Chain Of Thought Reasoning In Language Models
Original note title
multistep consistency models interpolate between one-step consistency and many-step diffusion to trade sampling speed for quality