INQUIRING LINE

Model Architecture and Internals · Conversational AI and Personalization · Training, RL, and Test-Time Scalingcross-cluster

What information does transcription destroy that direct speech pathways preserve?

This explores what gets lost when speech is converted to text first — the acoustic, articulatory, and prosodic information that direct speech-to-speech systems keep but a transcript throws away.

This explores what gets lost when speech is converted to text before processing — and the corpus is unusually direct about it. The clearest answer comes from work on skipping transcription entirely: LLaMA-Omni generates speech responses straight from speech input, hitting 226ms latency, and the reason it can is that speech embeddings carry acoustic information that text simply does not encode Can skipping transcription make voice assistants faster?. Text is a lossy compression of speech: it keeps the words and drops everything about *how* they were said — pitch, timing, emphasis, emotional coloring, speaker identity, the overlap and hesitation that signal meaning. A transcript of "sure, fine" cannot tell you whether it was warm or icy.

What makes this more than a hunch is that researchers can now show those discarded dimensions are real and structured. Self-supervised speech models don't learn language-specific letter-sounds at all — they infer the causal articulatory physics of how a vocal tract produces sound, which is why they transfer across languages and predict downstream performance better than phonetic probes Do speech models learn language-specific sounds or universal physics?. That whole layer of generative, body-level structure lives below the level a transcript can represent. Transcription doesn't just blur it; it has no slot for it.

The lost information is also separable and useful, not noise. Work on co-speech gesture shows speech can be decomposed across encoder layers into high-level semantic content and low-level motion/prosodic features, and it's the latter — the part text drops — that lets a system generate emotionally expressive, contextually appropriate gestures, even for voices it never trained on Can speech features be separated into semantic and stylistic components?. So the destroyed channel isn't decorative: it carries the affective and embodied signal that downstream behavior depends on.

Here's the lateral turn worth sitting with: text being lossy is sometimes the *point*, not a bug. There's a flip side where readability itself is treated as overhead — LLMs can compress meaning into non-human-readable form at ~28% length while keeping 99.5% semantic fidelity Can language models communicate without human-readable text?. And text's compression pressure actively homogenizes: high-frequency written forms flatten distinct inputs as users rephrase toward what the model handles best Does high-frequency text homogenize user input before generation?. Put those together and transcription looks like one instance of a general law — every move toward text trades distinctiveness (acoustic individuality, prosody, the unusual phrasing) for the model's preferred average. Direct speech pathways are valued precisely because they refuse that trade.

Sources 5 notes

Can skipping transcription make voice assistants faster?

LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Can speech features be separated into semantic and stylistic components?

DeepGesture's diffusion model splits speech into high-level semantic features and low-level motion features across encoder layers, enabling emotion-guided control. This disentanglement produces gestures that are both contextually appropriate and emotionally expressive, and generalizes to out-of-distribution synthetic voices.

Can language models communicate without human-readable text?

Instruction-tuned LLMs zero-shot generate and decode highly compressed, non-human-readable text while preserving 99.5% semantic fidelity at 27.9% of original length. This capacity generalizes across model families, suggesting readability is human overhead rather than model necessity.

Does high-frequency text homogenize user input before generation?

Adam's Law shows LLMs flatten distinct prompts at comprehension time as users rephrase toward higher-frequency forms the model handles best. The same distributional property that creates accuracy on common tasks filters out distinctiveness on the input side.

What information does transcription destroy that direct speech pathways preserve?

Sources 5 notes

Next inquiring lines