SYNTHESIS NOTE

Can models pass tests while missing the actual grammar?

Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.

Synthesis note · 2026-02-21 · sourced from Discourses

The BabyLM Challenge included an evaluation specifically designed to distinguish two kinds of generalization:

Surface generalization: based on sentence length, orthography, whether the sentence contains a particular word — patterns a model could use without knowing grammar
Linguistic generalization: based on actual grammatical structure — irregular past-tense forms, control constructions, embedded clause structure

Models were fine-tuned on an ambiguous training set where labels were consistent with either generalization, then evaluated on a test set that disambiguated which one the model converged on.

The key insight: a model can produce correct outputs on typical evaluation tasks while relying on surface generalizations rather than structural ones. If test sets are not specifically designed to rule out surface heuristics, you cannot tell which kind of generalization the model is using.

This has wide implications for how we evaluate LLMs. When a model answers a grammaticality judgment task correctly, we tend to assume it has learned the relevant grammar. But it may have learned that short sentences with common words tend to be grammatical, that sentences with complex embeddings tend to be flagged as ungrammatical, or some other surface regularity that happens to correlate with the training labels.

Instruction tuning provides a striking parallel: Does instruction tuning teach task understanding or output format? shows that IT models achieve comparable accuracy even when instructions are replaced with simplified or deliberately wrong ("delusive") instructions. Models learn the output format distribution — what kind of response is expected — rather than the task semantics the instructions describe. The "instruction-following" that benchmarks measure is largely format compliance that correlates with task understanding but doesn't require it, precisely paralleling how syntactic benchmark performance correlates with grammatical knowledge but doesn't require it.

The distinction matters for robustness: surface generalizations fail on unusual structures. Linguistic generalizations are rule-governed and extend systematically to novel forms. If deployment involves unusual syntactic structures, a model relying on surface heuristics will fail — and the failure won't be predictable from standard benchmark performance.

A behavioral counterpart exists in moral reasoning: Do LLMs generalize moral reasoning by meaning or surface form?. Minimal wording changes that reverse the moral meaning of a scenario (e.g., "wrongfully convicted" → "rightfully convicted") leave LLM moral ratings nearly unchanged (r=.99) while human ratings shift substantially (r=.54). This extends the surface-generalization finding from grammatical structure into behavioral/moral reasoning — the same failure mode operating at a higher cognitive level. Humans track the semantic reversal; LLMs track the token distribution.

Inquiring lines that read this note 19

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do benchmark improvements fail to reflect actual reasoning quality?

Do language models learn genuine linguistic structure or just surface patterns?

What structural advantages do diffusion language models offer over autoregressive methods?

Why do autoregressive models fail at controlling syntactic structure and semantic content?

Do language models develop causal world models or rely on statistical patterns?

Why do language models reproduce human EPA structure despite different architecture?

Do language models understand semantics or rely on pattern matching?

What substrate do supervised models lack that makes them weaker on low-resource languages?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Can surface-level correctness hide failures in structural learning by LLMs?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 189 in 2-hop network ·dense cluster Open in graph ↗

Can models pass tests while missing the actual g… Can language models learn grammar from child-scale… Does LLM grammatical performance decline with stru… Do hedging markers actually signal careful thinkin… Why do language models fail at communicative optim… Do LLMs generalize moral reasoning by meaning or s… Can language models solve ToM benchmarks without r…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models learn grammar from child-scale data? If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
the qualification: approaching performance doesn't mean using the same underlying rules
Does LLM grammatical performance decline with structural complexity? This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
the practical consequence: complex structures break surface heuristics
Do hedging markers actually signal careful thinking in AI? Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
the inference-time parallel: surface markers (hedging, explicit connectives) are unreliable proxies for underlying competence, just as surface learning heuristics are unreliable proxies for grammatical rules
Why do language models fail at communicative optimization? LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
the cross-linguistic taxonomy: "Do LLMs Resemble Humans" maps exactly which regularities transfer (sound symbolism, structural priming) vs. fail (word economy, syntactic ambiguity avoidance) — the surface/structural distinction runs through all of them
Do LLMs generalize moral reasoning by meaning or surface form? When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
behavioral evidence in moral domain: same surface-over-structure failure in moral judgment
Can language models solve ToM benchmarks without real reasoning? Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM benchmarks are another domain where correct outputs do not prove structural learning: SFT matches RL on ToM without reasoning training, suggesting models exploit distributional patterns in benchmark structure rather than performing genuine mental state inference

Can models pass tests while missing the actual grammar?

Inquiring lines that read this note 19

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4