Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
The BabyLM Challenge included an evaluation specifically designed to distinguish two kinds of generalization:
- Surface generalization: based on sentence length, orthography, whether the sentence contains a particular word — patterns a model could use without knowing grammar
- Linguistic generalization: based on actual grammatical structure — irregular past-tense forms, control constructions, embedded clause structure
Models were fine-tuned on an ambiguous training set where labels were consistent with either generalization, then evaluated on a test set that disambiguated which one the model converged on.
The key insight: a model can produce correct outputs on typical evaluation tasks while relying on surface generalizations rather than structural ones. If test sets are not specifically designed to rule out surface heuristics, you cannot tell which kind of generalization the model is using.
This has wide implications for how we evaluate LLMs. When a model answers a grammaticality judgment task correctly, we tend to assume it has learned the relevant grammar. But it may have learned that short sentences with common words tend to be grammatical, that sentences with complex embeddings tend to be flagged as ungrammatical, or some other surface regularity that happens to correlate with the training labels.
Instruction tuning provides a striking parallel: Does instruction tuning teach task understanding or output format? shows that IT models achieve comparable accuracy even when instructions are replaced with simplified or deliberately wrong ("delusive") instructions. Models learn the output format distribution — what kind of response is expected — rather than the task semantics the instructions describe. The "instruction-following" that benchmarks measure is largely format compliance that correlates with task understanding but doesn't require it, precisely paralleling how syntactic benchmark performance correlates with grammatical knowledge but doesn't require it.
The distinction matters for robustness: surface generalizations fail on unusual structures. Linguistic generalizations are rule-governed and extend systematically to novel forms. If deployment involves unusual syntactic structures, a model relying on surface heuristics will fail — and the failure won't be predictable from standard benchmark performance.
A behavioral counterpart exists in moral reasoning: Do LLMs generalize moral reasoning by meaning or surface form?. Minimal wording changes that reverse the moral meaning of a scenario (e.g., "wrongfully convicted" → "rightfully convicted") leave LLM moral ratings nearly unchanged (r=.99) while human ratings shift substantially (r=.54). This extends the surface-generalization finding from grammatical structure into behavioral/moral reasoning — the same failure mode operating at a higher cognitive level. Humans track the semantic reversal; LLMs track the token distribution.
Inquiring lines that use this note as a source 19
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How should benchmarks test whether models fit algorithms or patterns?
- Do language models learn surface patterns instead of underlying linguistic principles?
- Why do autoregressive models fail at controlling syntactic structure and semantic content?
- Do language models learn surface patterns that appear generalizable but actually fail under shift?
- What happens when formal languages satisfy hierarchy but fail learnability constraints?
- Does approaching human performance mean learning the same grammatical rules?
- Do LLMs rely on surface heuristics instead of learning recursive grammar rules?
- Do language models actually learn linguistic structure or just surface statistics?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- What language capabilities does fluency on standard benchmarks actually measure?
- Do language models encode deep syntactic structure or only surface-level patterns?
- What distinguishes surface generalizations from true linguistic generalizations?
- Can benchmark performance distinguish surface from structural linguistic knowledge?
- Why do surface generalizations fail on unusual syntactic structures?
- Can formal language pretraining address surface generalization without learning true linguistic structure?
- Do LLMs learn surface patterns instead of genuine linguistic structure?
- Why do language models reproduce human EPA structure despite different architecture?
- What substrate do supervised models lack that makes them weaker on low-resource languages?
- Can surface-level correctness hide failures in structural learning by LLMs?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models learn grammar from child-scale data?
If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
the qualification: approaching performance doesn't mean using the same underlying rules
-
Does LLM grammatical performance decline with structural complexity?
This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
the practical consequence: complex structures break surface heuristics
-
Do hedging markers actually signal careful thinking in AI?
Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
the inference-time parallel: surface markers (hedging, explicit connectives) are unreliable proxies for underlying competence, just as surface learning heuristics are unreliable proxies for grammatical rules
-
Why do language models fail at communicative optimization?
LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
the cross-linguistic taxonomy: "Do LLMs Resemble Humans" maps exactly which regularities transfer (sound symbolism, structural priming) vs. fail (word economy, syntactic ambiguity avoidance) — the surface/structural distinction runs through all of them
-
Do LLMs generalize moral reasoning by meaning or surface form?
When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
behavioral evidence in moral domain: same surface-over-structure failure in moral judgment
-
Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM benchmarks are another domain where correct outputs do not prove structural learning: SFT matches RL on ToM without reasoning training, suggesting models exploit distributional patterns in benchmark structure rather than performing genuine mental state inference
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large Linguistic Models: Investigating LLMs' metalinguistic abilities
- Using Computational Models to Test Syntactic Learnability
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
- Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways
- Faith and Fate: Limits of Transformers on Compositionality
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Chain-of-thought Reasoning Is A Policy Improvement Operator
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Original note title
lms may learn surface generalizations rather than linguistic generalizations despite correct outputs