SYNTHESIS NOTE
Model Architecture and Internals

Can generating entire videos at once beat keyframe interpolation?

Does synthesizing a video's full temporal duration in a single pass, rather than generating keyframes and filling gaps, produce more globally coherent motion? This explores whether pipeline decomposition fundamentally limits motion consistency.

Synthesis note · 2026-06-03 · sourced from Multimodal

Text-to-video generation is harder than text-to-image because motion is sensitive to error and adds a temporal dimension that strains memory, compute, and data. The prevalent approach generates distant keyframes first, then fills the gaps with a cascade of temporal super-resolution models — and Lumiere identifies an inherent limitation in this: it cannot learn globally-coherent motion, because the keyframe-then-interpolate pipeline never represents the whole temporal trajectory at once.

Lumiere's response is architectural: a Space-Time U-Net that generates the entire temporal duration of the video in a single pass, incorporating both spatial and temporal down- and up-sampling modules. Generating the full clip at once — rather than stitching independently-generated keyframes — is what produces coherent motion, and it generalizes to image-to-video, inpainting, and stylized generation.

The transferable keeper is a generation principle: coherence is a property of generating the whole at once, not of stitching locally-coherent pieces. Cascades that decompose a globally-structured output into independently-generated fragments lose the global structure exactly where it matters (motion, here). This rhymes with Can iterative revision cycles match how humans actually write?: both treat a long structured artifact as something to denoise as a whole rather than assemble piecewise.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 111 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

generating a video's full duration in a single space-time pass beats keyframe-plus-interpolation for globally coherent motion