Why does transcription destroy prosodic information in speech processing?
This explores why turning speech into text throws away the acoustic layer — rhythm, pitch, emotion, timing — and what the corpus says is actually lost in that conversion.
This explores why turning speech into text throws away the acoustic layer — the rhythm, pitch, and timing that text simply has no slot for. The clearest answer in the collection is structural: text is a lossy compression of speech, and prosody is exactly the part that doesn't survive the squeeze. When LLaMA-Omni skips transcription and generates speech responses directly from speech input, it hits 226-millisecond latency precisely because speech embeddings carry acoustic information that text representations discard Can skipping transcription make voice assistants faster?. Transcription isn't a neutral translation — it's a bottleneck that keeps the words and drops everything about how they were said.
What exactly is in that discarded layer becomes clearer from work on what speech models actually learn. Self-supervised speech models don't pick up language-specific phonetic categories — they infer the causal, articulatory physics of how a vocal tract produces sound in the first place Do speech models learn language-specific sounds or universal physics?. Prosody lives in that physics: the continuous gestures of pitch and timing that generate the acoustic signal. Transcription collapses this continuous, physical process into a discrete sequence of symbols, so the very thing the acoustic signal was encoding gets quantized away.
There's a useful lateral angle here too: prosody isn't one undifferentiated blob. Work on gesture generation shows speech can be split into a high-level semantic channel (the meaning) and a low-level expressive channel (motion, emotion, style), and these can be disentangled across a model's layers Can speech features be separated into semantic and stylistic components?. Transcription, in effect, keeps only the semantic channel and severs the expressive one — which is why emotion-guided control needs the acoustic features that text never had.
The deeper framing is that this is the same compression tradeoff that shows up everywhere in language modeling. Modeling text well is mathematically equivalent to compressing it, and a compressor's whole job is to throw away what it judges redundant Can text-trained models compress images better than specialized tools?. Text is a writing system optimized to preserve lexical meaning, not vocal performance — so prosody reads as 'redundant' to it and gets dropped at the door. Transcription doesn't destroy prosody by accident; it destroys it by design, because text was never built to hold it.
The payoff worth noticing: the same property that makes transcription destructive is what makes skipping it powerful. Once you keep the acoustic embedding instead of collapsing to text, you not only preserve prosody — you can start generating a response before the full input even arrives, because the acoustic stream carries predictive cues that text only reveals after the sentence is complete Can skipping transcription make voice assistants faster?.
Sources 4 notes
LLaMA-Omni generates speech responses directly from speech input without transcribing to text first, achieving 226ms latency. This works because speech embeddings preserve acoustic information that text loses, enabling generation before full input is received.
Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.
DeepGesture's diffusion model splits speech into high-level semantic features and low-level motion features across encoder layers, enabling emotion-guided control. This disentanglement produces gestures that are both contextually appropriate and emotionally expressive, and generalizes to out-of-distribution synthetic voices.
Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.