INQUIRING LINE

Does direct speech-to-speech generation really eliminate transcription latency?

This explores whether generating speech directly from speech input actually removes the delay that comes from transcribing audio to text first — and what the corpus says is really being saved (and what's being traded away).


This explores whether skipping the transcription step in voice systems genuinely eliminates latency, or just relocates it. The short answer from the corpus is: yes, it removes a real and specific cost — and the reason is more interesting than "fewer steps." The classic voice pipeline is a cascade: speech goes to a recognizer, the recognizer emits text, a language model reads the text, and only then does generation begin. Each stage has to wait for the previous one to finish a unit before it can act. The direct approach in Can skipping transcription make voice assistants faster? collapses that into ~226ms not merely by deleting a box, but because speech embeddings carry acoustic information the model can start acting on before the full utterance arrives — text is a lossy, late-arriving intermediate, and removing it lets generation begin earlier.

But "eliminate" is too clean a word. What the direct model really does is trade one cost for another. The transcription step wasn't only slow — it was also a place where errors got corrected and intent got resolved. Real-world recognizers run at 15–30% error rates in noisy settings, which is exactly why traditional dialogue systems leaned on probabilistic belief-tracking rather than trusting a single transcript (Why do dialogue systems need probabilistic reasoning?). Skip transcription and you also skip that explicit error-handling layer; the burden of coping with ambiguous, messy audio moves inside the model instead of disappearing.

The deeper reframing is that the win isn't "no transcription" — it's "no hand-offs." The most aggressive version of this idea (Can a single model learn when to speak and respond?) treats language, audio, and video as one interleaved token stream so that even turn-taking — knowing when to speak — becomes learned behavior inside a single model rather than a separately engineered module. The latency savings there come from the same source: every boundary between specialized components is a place where one stage waits on another, and unifying them removes the waiting, not just the transcribing.

There's also a reason this works acoustically and not just architecturally. Self-supervised speech models appear to learn the language-agnostic physics of how a vocal tract produces sound rather than discrete phonetic categories (Do speech models learn language-specific sounds or universal physics?). That continuous, sub-symbolic representation is what makes it viable to generate from speech directly — there's rich structure to act on without first quantizing everything into words.

So the honest answer: direct speech-to-speech does eliminate transcription latency specifically, and the gain is real because text was both a delay and a loss of information. But it doesn't eliminate the *work* transcription was quietly doing — error correction and intent disambiguation — it just internalizes it. The latency you save is the cost of waiting for hand-offs between modules; the risk you inherit is that the explicit safety net for noisy, misheard speech is gone, and the model now has to absorb that uncertainty on its own.


Sources 4 notes

Can skipping transcription make voice assistants faster?

LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.

Why do dialogue systems need probabilistic reasoning?

Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.

Can a single model learn when to speak and respond?

Wan-Streamer represents language, audio, and video as one interleaved causal token stream, allowing response timing and turn management to be learned jointly within a single Transformer rather than engineered as separate modules, achieving sub-second latency.

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Next inquiring lines