SYNTHESIS NOTE

Can skipping transcription make voice assistants faster?

Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?

Synthesis note · 2026-05-03 · sourced from Speech Voice

Conventional voice assistants run a three-stage pipeline: automatic speech recognition (ASR) converts speech to text, an LLM generates a text response, and text-to-speech (TTS) renders that response as audio. Each stage adds latency and propagates errors, and the total response time is dominated by stage-by-stage processing rather than by the LLM's reasoning. LLaMA-Omni eliminates the transcription step entirely. It integrates a pretrained speech encoder, a speech adaptor, an LLM (built on Llama-3.1-8B-Instruct), and a streaming speech decoder so that the system generates text and speech responses directly from speech instructions, achieving a response latency as low as 226 milliseconds.

The architectural lesson is that transcription is not a free intermediate representation — it is a serialization step that destroys prosodic information and forces full-utterance processing before generation can begin. By passing speech embeddings directly into the LLM via an adaptor, LLaMA-Omni preserves the acoustic information the LLM might use — including the articulatory substrate identified in Do speech models learn language-specific sounds or universal physics? — and lets generation begin as soon as enough of the input has been encoded, without waiting for a complete textual transcript.

The supporting piece is the InstructS2S-200K dataset of 200K speech instructions paired with speech responses, which makes the alignment trainable. Without paired speech-to-speech instruction data, end-to-end training has nothing to optimize. The general principle: when latency dominates user experience, the right intervention is to remove pipeline stages, not to optimize each stage in isolation.

Inquiring lines that read this note 8

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What articulatory information do speech signals carry that text cannot?

Why do benchmark improvements fail to reflect actual reasoning quality?

Why do speech benchmarks still measure transcription instead of comprehension?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 101 in 2-hop network ·medium cluster Open in graph ↗

Can skipping transcription make voice assistants… Why do dialogue systems need probabilistic reasoni… Do speech models learn language-specific sounds or… What speech tasks remain without standardized benc… Can models precompute answers before users ask que…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do dialogue systems need probabilistic reasoning? Explores whether deterministic flowchart-based dialogue systems can handle realistic speech recognition error rates of 15-30 percent, and what alternative approaches might be necessary.
contrasts: POMDPs absorb ASR noise probabilistically; LLaMA-Omni removes the ASR stage; the two represent compensate-for vs design-around responses to the same problem
Do speech models learn language-specific sounds or universal physics? Exploring whether self-supervised speech models encode phonetic categories tied to specific languages or instead capture the underlying vocal-tract physics common to all humans. This matters for understanding why these models transfer across languages without retraining.
supports: gives a principled reason why bypassing transcription should help — the speech encoder carries articulatory structure that text loses
What speech tasks remain without standardized benchmarks? Speech evaluation has strong benchmarks for transcription and translation, but broader comprehension and reasoning tasks over audio lack standardized measurement. This gap may constrain which capabilities researchers prioritize building.
extends: same Voxtral/Speech Voice cluster; transcription-centric benchmarks impede evaluation of exactly the speech-to-speech capabilities LLaMA-Omni demonstrates
Can models precompute answers before users ask questions? Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
extends: both reduce user-perceived latency by relocating where compute happens; LLaMA-Omni removes a pipeline stage; sleep-time compute precomputes one in advance

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

eliminating speech transcription enables 226 millisecond response latency — direct speech-to-speech generation collapses the cascade that ASR-LLM-TTS pipelines impose

Can skipping transcription make voice assistants faster?

Inquiring lines that read this note 8

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4