Using Computational Models to Test Syntactic Learnability

Paper · Source

We study the learnability of English filler—gap dependencies and the “island” con- straints on them by assessing the generalizations made by autoregressive (incremental) language models that use deep learning to predict the next word given preceding con- text. Using factorial tests inspired by experimental psycholinguistics, we find that models acquire not only the basic contingency between fillers and gaps, but also the unboundedness and hierarchical constraints implicated in the dependency. We evalu- ate a model’s acquisition of island constraints by demonstrating that its expectation for a filler—gap contingency is attenuated within an island environment. Our results provide empirical evidence against the Argument from the Poverty of the Stimulus for this particular structure.

Introduction. The English filler–gap dependency is the co-variation between a wh-word or phrase (a filler) and an empty syntactic position (a gap). It is special in that it can span over a potentially unbounded number of nodes in a syntactic tree, yet it is subject to a subtle set of constraints known as island constraints (Ross, 1967). For example, in the grammatical sentence in (1-a), the dependency between the filler and the gap spans two sentential embeddings. However, a similar sentence, (1-b), is rendered ungrammatical when the gap site resides within a syntactic ‘island’, in this case a Complex Noun Phrase. A successful theory of the filler–gap dependency and its associated constraints must deal with two interrelated facts: First, despite some inter-language variability, the same set of structures arise as syntactic islands in language after language. Second, despite noisy and primarily negative evidence from caregivers, children within an individual language commu- nity tend to coordinate on the same set of islands.

Discussion / Conclusion. As mentioned in the introduction, we believe that one key feature of this paper is its method- ological contributions and hope that the methodology deployed here can be extended beyond the case of the filler–gap dependency. The approach taken in this paper involves assessing the capabilities of Artificial Neural Network models by testing them similarly to how one would test a human subject in a psycholinguistic experiment. Constructing test suites that mimic online processing experiments in humans makes it possible to test any model that makes incremental predictions about language, even ones whose internal states are opaque, such as RNNs and Transformers. Furthermore, this method can be used to test learning outcomes over a wide array of syntactic structures. Our tests reveal that these weakly-biased models acquire impressively sophisticated generalizations regarding the filler–gap dependency and island constraints from even a childhood’s quantity of linguistic input, though in some cases we find acquisition failures. It is our hope that this method gains traction among psycholin- guists studying incremental models of processing, as well as syntacticians who are more concerned with grammatical representations.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

Why do benchmark improvements fail to reflect actual reasoning quality?

Do language models learn genuine linguistic structure or just surface patterns?

What structural advantages do diffusion language models offer over autoregressive methods?

Why do autoregressive models fail at controlling syntactic structure and semantic content?

Do language models develop causal world models or rely on statistical patterns?

Do language models understand semantics or rely on pattern matching?

What substrate do supervised models lack that makes them weaker on low-resource languages?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Can surface-level correctness hide failures in structural learning by LLMs?

How does latent reasoning compare to verbalized chain-of-thought?

Why does recursion on latent state drive generalization better than hierarchy?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Using Computational Models to Test Syntactic Learnability

Synthesis notes that discuss concepts related to this paper 5

Lines of inquiry this paper opens 24