Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun “probing” models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show “inference of articulatory kinematics” as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to- Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.
Introduction. Self-supervised learning (SSL) has revolutionized every field of machine learning, by providing rich features of natural data without human-annotated labels. Likewise, speech SSL models have been proven to be successful in various speech downstream tasks [1]. To understand such utility, the internal representation of speech SSL models has been scrutinized by probing analyses for known speech and linguistic features, such as low-level acoustics, phonetics, and lexical semantics [2, 3, 4, 5]. A comparative analysis by Cho et al. [5] demonstrates that the state-of-the-art SSL models are highly correlated with articulatory kinematics and the correlation score can indicate the success of the SSL model in downstream tasks. This finding is extended to developing a high-performance Acoustic-to- Articulatory inversion (AAI) model [6]. Here, we test an intriguing hypothesis – speech SSL models infer the causal articulatory processes that generate the speech acoustic signal.
Discussion / Conclusion. We demonstrate that the recent speech SSL models can recover articulatory kinematics with simple linear mapping, achieving high performance inversion regardless of speakers, languages, and dialects. This is further verified by finding affine transformations from one articulatory system to another. Our findings provide strong evidence that there is a canonical basis of articulatory phonology which is naturally emerging in self-supervised learning of speech. Our findings evoke another interesting hypothesis that articulatory kinematics are not only the physical interface of speech but also the continuous embedding representations of phonetics. As the SSL models are trained by the masked prediction objective without any external labels, the resulting representations are perceptual descriptions of how sounds are shaped in natural speech data, which also shows high correspondence to the human auditory perception [24].