Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps.
Introduction. Human interaction with the physical world is fundamentally streaming and full-duplex. People do not first finish perceiving, then reason in isolation, and only afterwards produce a response. Instead, they continuously watch, listen, speak, gesture, react, pause, and interrupt, with perception and expression overlapping at audio-visual timescales. Building artificial systems with the same interaction pattern is becoming increasingly important for embodied assistants, real-time digital humans, live broadcasting, interactive entertainment, and world models that can be explored or controlled online [14, 22, 39]. These applications require more than a model that can understand an image, generate a clip, or answer a text prompt. They require a real-time interactive foundation model: a model that continuously consumes audio-visual observations, maintains a persistent world and dialogue state, decides when and how to respond, and expresses that response through synchronized language, speech, and video with very low latency. Recent progress has advanced several pieces of this goal.
Discussion / Conclusion. We presented Wan-Streamer, a native-streaming, end-to-end foundation model for real-time full-duplex text, audio, and video interaction. Unlike cascaded systems that alternate among perception, language modeling, speech synthesis, and visual generation modules, Wan-Streamer represents user inputs and agent outputs across all modalities as one causal stream processed by a single Transformer. With fully causal audio and video VAEs, causal encoders and decoders, and a block-causal Transformer, the model can perceive current observations, generate synchronized audio-visual responses, emit each streaming unit, and commit the generated latents back into history with minimal delay. Together with the thinker-performer serving design, Wan-Streamer reaches sub-second interactive latency while preserving full-history context.