Can a single model learn when to speak and respond?
Does combining perception, generation, and turn-taking into one streaming model let timing and interruption handling emerge naturally, rather than requiring separate engineered modules?
Cascaded conversational systems chain separate modules — VAD, ASR, language, TTS, animation, video generation — and pay for it twice: pipeline latency and error accumulation across handoffs. Wan-Streamer argues the cascade also makes the interactional problem unsolvable in principle, because deciding when and whether to respond, manage a turn, or absorb an interruption is not any one module's job. By representing language, audio, and video as one interleaved stream of input and output tokens under block-causal attention, it makes perception, generation, response timing, turn management, and cross-modal synchronization all learned jointly within a single Transformer, with streaming units as short as 160 ms at 25 fps and sub-second latency.
The load-bearing reframe: in a cascade, turn-taking has to be engineered (a VAD threshold, a silence timer); in a unified causal stream, timing is emergent behavior the model learns from the same sequence it uses to perceive and speak. Human interaction is full-duplex — we watch, listen, speak, and interrupt with overlap — and that overlap is precisely what module boundaries destroy.
This extends a thread the vault holds on collapsing the speech cascade. Since Can skipping transcription make voice assistants faster?, removing the ASR/TTS hops already buys dramatic latency; Wan-Streamer pushes the same logic one tier further, absorbing video and — crucially — turn control into the unified model rather than only the speech path. It also concretizes a long-running design question on the architecture map: since Why do AI conversations reliably break down after multiple turns? asks why multi-turn interaction degrades, full-duplex streaming reframes the turn itself as a learned, continuous decision rather than a discrete handoff between modules.
The strongest counterargument is the classic monolith-versus-pipeline trade: a single end-to-end model sacrifices the modular debuggability, independent upgrade paths, and component-level guarantees that cascades give you, and v0.1 is an early system whose latency/quality numbers are demonstration-scale, not field-hardened. The unification may simply move error accumulation inside the model where it is harder to inspect.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can skipping transcription make voice assistants faster?
Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?
extends: pushes cascade-collapse from the speech path to video and turn control in one unified model
-
Can agents fail from weak memory control rather than missing knowledge?
As multi-turn agent workflows grow longer, performance degrades—but is this due to insufficient context or poor memory management? This explores whether memory *control* is the real bottleneck.
convergent-with: persistent dialogue/world state committed back into history is the streaming analogue of bounded committed state
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
- Proactive Conversational Agents with Inner Thoughts
- DiscussLLM: Teaching Large Language Models When to Speak
- LLMs Get Lost In Multi-Turn Conversation
- Efficient Streaming Language Models with Attention Sinks
- Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion
Original note title
full-duplex interaction wants a single streaming model not a cascade — turn-taking and timing become learnable behaviors once perception and generation share one causal stream