Why do only context-sensitive formal languages transfer effectively to natural language?
This explores why pretraining on artificial formal languages only helps a model learn real language when those formal languages capture deep hierarchical structure — and what that tells us about how transformers actually learn grammar.
This explores why pretraining on artificial formal languages only helps a model learn real language when those formal languages capture deep hierarchical structure — not just any pattern. The corpus has a clean answer to the literal question, then a set of surprising neighbors that reframe it. The direct finding is that transfer succeeds only when a formal language clears two bars at once: it has to encode nested, hierarchical dependencies (the kind of structure that real grammar lives in), AND it has to be something a transformer can actually learn and generalize across lengths What formal languages actually help transformers learn natural language?. Miss either bar — too flat to carry structure, or too complex for the architecture to absorb — and the head start evaporates. A purely sequential or context-free toy language doesn't transfer because it isn't teaching the model the thing natural language needs.
What makes this concrete is that the benefit is mechanistic, not vague. Pre-pretraining a 1B model on hierarchical formal languages hits the same loss with 33% fewer real-language tokens, and the very attention heads shaped on the formal language stay load-bearing for syntax when the model moves to natural text Can formal language pretraining make language models more efficient?. So "transfer" isn't a metaphor — specific circuits learned on the artificial grammar are the ones doing the grammatical work later. That's why the match has to be structural: you're literally pre-wiring the parser.
The lateral payoff is seeing this against what transformers fail at. Even top models systematically misread embedded clauses and complex nominals, and the errors get predictably worse as syntactic depth grows — statistical learning grabs surface patterns but not the deep recursive rules Why do large language models fail at complex linguistic tasks?. That failure is the mirror image of the transfer result: hierarchical pretraining helps precisely because depth is where ordinary training leaves a gap. The same theme shows up in the inverse direction too — models translate natural language into logic with valid syntax but broken meaning, suggesting they read formal structure better than they can produce it Can large language models translate natural language to logic faithfully?.
There's a deeper lesson hiding here about how much structure to impose. More formalization is not always better: partial symbolic augmentation — enriching natural language with selective logical scaffolding rather than fully converting it — beats both plain language and full formalization, because total conversion throws away semantic information while plain text lacks backbone Why does partial formalization outperform full symbolic logic?. Read alongside the transfer finding, a pattern emerges: the win is always at the join between structure and meaning, never at either pure extreme. Context-sensitive languages transfer because they sit at that join — structured enough to teach hierarchy, learnable enough to stick.
If you want to push further, the same "structure vs. statistics" tension surfaces in pragmatics, where models fail to flex inferences to communicative context the way humans do Can language models adapt implicature to conversational context?, and in reasoning, where models internally rank symbolic-computation tokens as most worth preserving — quietly privileging structure over grammar and filler Which tokens in reasoning chains actually matter most?. The thread connecting all of these: transformers reward the kind of structured signal that hierarchical formal languages are unusually good at delivering.
Sources 7 notes
Transfer from formal to natural language succeeds only when formal languages satisfy two conditions: they capture hierarchical dependencies (Chomsky hierarchy) AND are learnable by transformers with length generalization (circuit complexity). Formal languages meeting both constraints outperform matched natural language training.
Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.