INQUIRING LINE

How do speech encoders learn articulatory physics without phonetic labels?

This explores how self-supervised speech models pick up the bodily mechanics of how a vocal tract makes sound — without ever being told which sound is which phoneme.


This explores how speech encoders learn the physics of articulation — what the tongue, lips, and vocal folds are doing — purely from raw audio, with no phonetic labels to guide them. The most direct answer in the corpus is that they don't learn phonetic categories at all; they learn the causal process underneath them. Self-supervised speech models, by predicting masked or future audio, end up inferring the language-agnostic mechanics of how a vocal tract produces acoustics rather than memorizing language-specific sound inventories Do speech models learn language-specific sounds or universal physics?. The tell is multilingual transfer: a model that had merely catalogued English phonemes shouldn't generalize to languages it never heard, but one that has internalized articulatory physics should — and does. Strikingly, this articulatory representation predicts downstream task performance better than directly probing for phonemes, which suggests the physics is the more fundamental thing the model is tracking.

Why would physics emerge for free? Because it's the most compressive explanation of the data. Every acoustic frame in speech is the downstream consequence of a continuous physical gesture, so a model forced to predict audio efficiently is rewarded for discovering the small set of articulatory controls that generate the large, messy space of sounds. This is the same lesson the corpus draws in reverse for text: text-only models are stuck because language has already stripped out the physics, geometry, and causality of the world before the model ever sees it Are text-only language models fundamentally limited by abstraction?. Speech is the opposite case — the signal still carries the physical dynamics, so a model trained on it can recover the generative process that text models are structurally denied.

That the recovered structure is *articulatory* rather than just *acoustic* shows up when you look at how speech encoders organize their layers. Models that split speech into high-level semantic features and low-level motion features across different encoder depths — and stay robust even on out-of-distribution synthetic voices — are exhibiting exactly the disentanglement you'd expect if the network had separated 'what is being said' from 'how the body is moving to say it' Can speech features be separated into semantic and stylistic components?. The motion channel surviving voice changes is a sign the model latched onto generative dynamics, not surface timbre.

There's a deeper pattern here worth noticing: neural networks repeatedly recover structured, near-symbolic geometry from raw prediction objectives without being told the structure exists. Language models spontaneously encode syntactic type and direction in polar-coordinate geometry How do language models encode syntactic relations geometrically?, and they learn meaning by compressing purely relational structure with no external referents at all, à la Saussure Can language models learn meaning without engaging the world?. Articulatory physics emerging without phonetic labels is the speech-domain version of the same phenomenon — the labels were never the point, and removing them lets the model find the more economical underlying axis. If you want to go further, the contrast between Do speech models learn language-specific sounds or universal physics? and Are text-only language models fundamentally limited by abstraction? is the cleanest way to see why modality, not labeling, decides what a model can ground.


Sources 5 notes

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can speech features be separated into semantic and stylistic components?

DeepGesture's diffusion model splits speech into high-level semantic features and low-level motion features across encoder layers, enabling emotion-guided control. This disentanglement produces gestures that are both contextually appropriate and emotionally expressive, and generalizes to out-of-distribution synthetic voices.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Next inquiring lines