INQUIRING LINE

What happens when formal languages satisfy hierarchy but fail learnability constraints?

This explores what the corpus says when you train language models on languages that have the right nested, hierarchical structure but that models can't actually internalize as rules — does hierarchy help, or do models just fake it with surface tricks?


This is really a question about the gap between structure that exists in the data and structure a model can learn — formal languages can be perfectly hierarchical, but whether a model captures that hierarchy or just imitates it is a separate matter. The corpus splits cleanly into a hopeful finding and a sobering one. On the hopeful side, hierarchy in the training signal genuinely transfers: pre-pretraining 1B models on hierarchical formal languages reaches the same loss with 33% fewer natural-language tokens, and the attention heads shaped by those formal languages stay critical for real syntactic work later Can formal language pretraining make language models more efficient?. So satisfying hierarchy is not cosmetic — it leaves a durable, mechanistic fingerprint.

The trouble starts when you ask whether the model learned the hierarchy or learned a shortcut that mimics it. Several notes converge on the same answer: models pass grammar tests by leaning on sentence length, word choice, and orthography rather than grammatical rules, and standard benchmarks can't tell the two apart unless the tests are specifically built to rule out surface heuristics Can models pass tests while missing the actual grammar?. That's the failure-of-learnability case in action — the hierarchy is present in the language, but what gets absorbed is a correlate, not the constraint.

And the corpus predicts exactly where that shortcut breaks: as structural complexity climbs. Grammatical competence degrades predictably with syntactic depth and embedding — simple sentences are handled, deep recursion fails consistently Does LLM grammatical performance decline with structural complexity?. Even top-tier models like Llama3-70b systematically misidentify embedded clauses and complex nominals, and the degradation tracks depth so reliably it looks like a law Why do large language models fail at complex linguistic tasks?. The breakdowns map to specific places — implicit relations, forward-planning discourse, attention layers — not just generic difficulty Where exactly do language models fail at structural language tasks?. The hierarchy was learnable enough to look right and unlearnable enough to collapse under depth.

The deeper diagnosis is that this isn't a complexity ceiling at all — it's a novelty ceiling. Reasoning and structural failures track instance-level unfamiliarity rather than task complexity: models fit instance-based patterns instead of generalizable algorithms, so any structure succeeds if something similar was in training and fails otherwise, regardless of how 'deep' it is Do language models fail at reasoning due to complexity or novelty?. That reframes the whole question. A hierarchical formal language that fails learnability constraints doesn't produce a model that's a little worse at hierarchy — it produces one that's memorized a region of the structure and falls off a cliff at its edge.

What you didn't know you wanted to know: this same shape — correct-looking surface behavior detached from the underlying competence — shows up far beyond grammar. Models can explain a concept correctly, fail to apply it, and even recognize their own failure, a 'Potemkin' pattern that signals functionally disconnected explanation and execution pathways Can LLMs understand concepts they cannot apply?. So hierarchy-without-learnability isn't a niche linguistics problem; it's one instance of a general signature where statistical imitation passes for structural understanding right up until the structure has to do real work.


Sources 7 notes

Can formal language pretraining make language models more efficient?

Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.

Can models pass tests while missing the actual grammar?

BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Do formal languages that satisfy hierarchy but fail learnability constraints reveal a fundamental gap between linguistic structure and model capacity—or a gap between training signals and model induction?** Still open. What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026. • Pre-training on hierarchical formal languages improves token efficiency by 33% on natural language, and attention patterns persist into downstream syntax tasks, suggesting hierarchy leaves a mechanistic trace (~2025, arXiv:2502.19249). • Models pass grammar tests via surface heuristics (sentence length, word choice, orthography) rather than learned grammatical rules; standard benchmarks cannot distinguish imitation from competence unless specifically adversarial (~2024). • Grammatical competence degrades **predictably and reliably** with syntactic depth and embedding complexity; even Llama3-70B systematically fails on embedded clauses and deep recursion (~2025, arXiv:2503.19260). • Reasoning and structural failures correlate with **instance-level unfamiliarity**, not task-level complexity—models memorize training-adjacent patterns and collapse outside that region (~2026, arXiv:2602.06176). • "Potemkin understanding": models produce correct explanations while failing execution, signaling disconnected explanation and reasoning pathways (~2024). Anchor papers (verify; mind their dates): arXiv:2502.19249 (2025); arXiv:2503.19260 (2025); arXiv:2602.06176 (2026); arXiv:2401.11817 (2024). Your task: (1) RE-TEST each constraint. For depth-dependent failure, check whether newer scaling, mixture-of-experts routing, sparse attention, or depth-specific training (e.g., curriculum on recursion depth) have since relaxed the edge-of-distribution collapse. For the instance-level diagnosis, probe whether recent meta-learning or in-context few-shot methods decouple memorization from generalization. Separate the durable question (can hierarchy be *learned* from finite data?) from the perishable claim (Llama3-scale models are fundamentally capped at shallow recursion). (2) Surface the strongest **contradicting or SUPERSEDING work** from the last ~6 months—especially any that shows self-improving agents, test-time learning, or structured decoding methods that *compensate* for shallow structural induction. (3) Propose 2 research questions assuming the regime **may have shifted**: (a) Does hierarchical bias *during pretraining* (not fine-tuning) enable depth-generalization beyond instance-matched training? (b) Can orchestration-layer constraints (beam search with parse-tree bounds, memory-augmented decoding, agent loops) *rescue* learned-but-dormant hierarchical knowledge? Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines