SYNTHESIS NOTE
Conversational AI and Personalization Training, RL, and Test-Time Scaling

Can skipping transcription make voice assistants faster?

Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?

Synthesis note · 2026-05-03 · sourced from Speech Voice

Conventional voice assistants run a three-stage pipeline: automatic speech recognition (ASR) converts speech to text, an LLM generates a text response, and text-to-speech (TTS) renders that response as audio. Each stage adds latency and propagates errors, and the total response time is dominated by stage-by-stage processing rather than by the LLM's reasoning. LLaMA-Omni eliminates the transcription step entirely. It integrates a pretrained speech encoder, a speech adaptor, an LLM (built on Llama-3.1-8B-Instruct), and a streaming speech decoder so that the system generates text and speech responses directly from speech instructions, achieving a response latency as low as 226 milliseconds.

The architectural lesson is that transcription is not a free intermediate representation — it is a serialization step that destroys prosodic information and forces full-utterance processing before generation can begin. By passing speech embeddings directly into the LLM via an adaptor, LLaMA-Omni preserves the acoustic information the LLM might use — including the articulatory substrate identified in Do speech models learn language-specific sounds or universal physics? — and lets generation begin as soon as enough of the input has been encoded, without waiting for a complete textual transcript.

The supporting piece is the InstructS2S-200K dataset of 200K speech instructions paired with speech responses, which makes the alignment trainable. Without paired speech-to-speech instruction data, end-to-end training has nothing to optimize. The general principle: when latency dominates user experience, the right intervention is to remove pipeline stages, not to optimize each stage in isolation.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

eliminating speech transcription enables 226 millisecond response latency — direct speech-to-speech generation collapses the cascade that ASR-LLM-TTS pipelines impose