SYNTHESIS NOTE

Can generating entire videos at once beat keyframe interpolation?

Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.

Synthesis note · 2026-06-03 · sourced from Multimodal

Text-to-video generation is harder than text-to-image because motion is sensitive to error and adds a temporal dimension that strains memory, compute, and data. The prevalent approach generates distant keyframes first, then fills the gaps with a cascade of temporal super-resolution models — and Lumiere identifies an inherent limitation in this: it cannot learn globally-coherent motion, because the keyframe-then-interpolate pipeline never represents the whole temporal trajectory at once.

Lumiere's response is architectural: a Space-Time U-Net that generates the entire temporal duration of the video in a single pass, incorporating both spatial and temporal down- and up-sampling modules. Generating the full clip at once — rather than stitching independently-generated keyframes — is what produces coherent motion, and it generalizes to image-to-video, inpainting, and stylized generation.

The transferable keeper is a generation principle: coherence is a property of generating the whole at once, not of stitching locally-coherent pieces. Cascades that decompose a globally-structured output into independently-generated fragments lose the global structure exactly where it matters (motion, here). This rhymes with Can iterative revision cycles match how humans actually write?: both treat a long structured artifact as something to denoise as a whole rather than assemble piecewise.

Inquiring lines that read this note 2

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What articulatory information do speech signals carry that text cannot?

Why do cascade pipelines fail to capture global motion structure?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How does single-pass generation differ from multi-stage synthesis architecturally?

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 103 in 2-hop network ·medium cluster Open in graph ↗

Can generating entire videos at once beat keyfra… Can iterative revision cycles match how humans act…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can iterative revision cycles match how humans actually write? Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?
shared principle: generate/denoise the whole structured artifact rather than assemble locally-coherent fragments

Can generating entire videos at once beat keyframe interpolation?

Inquiring lines that read this note 2

Related concepts in this collection 1

Related papers in this collection 8

Search by related questions 4