INQUIRING LINE

Can speech embeddings carry articulatory structure that text cannot?

This explores whether self-supervised speech embeddings capture the physical machinery of how sounds are produced — the vocal tract's articulatory process — in a way that text, as a symbolic abstraction, structurally cannot.


This explores whether speech embeddings carry something text loses: the physics of how sound is made. The corpus gives a fairly direct yes. Self-supervised speech models appear to learn the causal articulatory process behind acoustics — the language-agnostic mechanics of how the vocal tract shapes air into sound — rather than memorizing language-specific phonetic categories Do speech models learn language-specific sounds or universal physics?. The tell is generalization: this articulatory grounding predicts downstream performance and multilingual transfer better than phonetic probing does, which is what you'd expect if the model has latched onto the generative physics rather than surface labels.

The reason text can't do this is the more interesting half. Text is a lossy human abstraction — it strips the physics, geometry, and causality present in the original signal, leaving language models to shuffle symbols without contact with the dynamics that produced them Are text-only language models fundamentally limited by abstraction?. Articulation is exactly the kind of source dynamic that gets compressed away when speech becomes a written string. A related argument sharpens the point: meaning (and by extension, grounded structure) requires a relation between a form and what generated it, not just form-to-form prediction — text training has no access to that relation Can language models learn meaning from text patterns alone?. Speech embeddings, trained on the acoustic signal itself, sit one step closer to the source.

The lateral surprise is that embeddings in general turn out to be richer than their training objective suggests. Static text embeddings already encode psycholinguistic dimensions you wouldn't expect from pure co-occurrence — including iconicity, the degree to which a word's sound resembles its meaning Do transformer static embeddings actually encode semantic meaning?. Iconicity is a faint echo of articulatory structure leaking back into text. And models spontaneously build structured, near-symbolic geometry from raw signal — encoding syntactic type and direction in polar coordinates within their activations How do language models encode syntactic relations geometrically?. If text models manufacture that much latent structure from impoverished input, speech models fed the actual acoustics have far more articulatory structure available to organize.

The boundary isn't absolute, though. Language models can perform genuine phonological analysis through step-by-step reasoning — constructing valid generalizations about sound patterns rather than just parroting them Can language models actually analyze language structure?. So text models can reason *about* articulation symbolically even if they can't *embed* it. The distinction worth carrying away: speech embeddings carry articulatory structure implicitly, as learned physics; text can only ever describe it explicitly, as inferred rules. Two different ways of knowing the same vocal tract — one from the inside of the signal, one from the outside.


Sources 6 notes

Do speech models learn language-specific sounds or universal physics?

Self-supervised speech models learn the language-agnostic physics of how the vocal tract produces acoustics, not language-specific phonetic categories. This explains their multilingual transfer and predicts their downstream task performance better than phonetic probing.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a speech & language AI researcher. The question: **Do speech embeddings carry articulatory structure that text cannot?** Treat this as still-open, not settled.

What a curated library found — and when (findings span 2023–2026; treat as dated claims, not current truth):
• Self-supervised speech models learn causal articulatory kinematics (the physics of vocal-tract motion), not just phonetic categories; this predicts multilingual transfer better than phonetic probing (~2023).
• Text is a lossy abstraction that strips physics and causality; LLMs shuffle symbols without contact to the generative dynamics that produced speech (~2023–2024).
• Embeddings encode richer structure than their training objectives suggest: static text embeddings already carry psycholinguistic dimensions like iconicity (word sound resembling meaning) (~2025).
• LLMs can perform step-by-step metalinguistic reasoning about phonology, constructing valid generalizations symbolically even without acoustic grounding (~2023).
• Language models spontaneously build near-symbolic geometry in activations—syntactic type and direction encoded in polar coordinates (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2310.10788 (2023-10): Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
• arXiv:2305.00948 (2023-05): Large Linguistic Models: Investigating LLMs' metalinguistic abilities
• arXiv:2508.12863 (2025-08): Word Meanings in Transformer Language Models
• arXiv:2412.05571 (2024-12): A polar coordinate system represents syntax in large language models

Your task:
(1) **RE-TEST THE PHYSICS CLAIM.** Has multimodal pretraining (video, audio + text joint embeddings) since 2025 let text-dominant models acquire articulatory structure without raw acoustics? Have new speech SSL architectures pushed further into causal modeling? Does the constraint still hold, or has the gap narrowed?
(2) **Surface work that contradicts the text-as-lossy thesis.** Find papers (last 6 months) showing text models recovering or inferring articulatory constraints from symbol-level training alone, or showing speech embeddings *don't* outperform text on downstream tasks where physics should matter.
(3) **Propose two frontier questions assuming the regime may have shifted:** (a) Can joint text–speech embeddings recover articulatory structure *only when speech is present*, or do fused models learn it even from text-only subsets? (b) If language models do spontaneously build symbolic geometry from impoverished text, what prevents them from building *dynamic* articulatory models from phonological notation alone?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines