SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

Synthesis note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly do LLMs break down with language structure? How should researchers navigate LLM reasoning research?

Between Circuits and Chomsky (2025) tests whether training language models on formal languages before natural language can improve acquisition efficiency. The result is surprisingly strong:

For a 1B-parameter model trained on ~1.6B natural language tokens, pre-pretraining on formal languages with hierarchical dependencies:

The effect is mechanistically grounded: attention heads acquired during pre-pretraining on formal languages remain crucial for the model's performance on syntactic evaluations in natural language. Structure from formal language training transfers to natural language processing at the level of learned mechanisms.

Why hierarchical formal languages specifically? Papadimitriou & Jurafsky (2023) showed that within the Chomsky hierarchy, context-sensitive languages transfer best to natural language. The key: effective transfer requires formal languages that capture the hierarchical dependency structures present in natural language. Not all formal languages transfer — only those that share the structural properties that matter for syntax.

This directly supports Can language models learn grammar from child-scale data?: if syntactic structure is efficiently acquirable from hierarchical formal languages (which encode the relevant inductive biases), then syntactic competence is trainable from far less data than previously thought — as long as the structure of training provides the right biases.

The broader implication: data volume matters less than structural inductive bias for syntactic generalization. LLMs trained on the right structures learn syntax efficiently; LLMs trained only on natural language may be learning syntax the hard way.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 124 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

pre-pretraining on hierarchical formal languages achieves 33% greater token efficiency than matched natural language training