What formal languages actually help transformers learn natural language?
Not all formal languages are equally useful for pre-pretraining. This explores which formal languages transfer well to natural language and why—combining structural requirements with what transformers can actually learn.
Pre-pretraining on formal languages improves natural language acquisition, but not all formal languages produce equal transfer. Between Circuits and Chomsky (2025) proposes a two-constraint model:
Constraint 1 (Chomsky hierarchy): The formal language must capture hierarchical dependency structures present in natural language. Within the Chomsky hierarchy, context-sensitive languages transfer best to natural language (Papadimitriou & Jurafsky 2023). Simpler formal languages — regular, context-free — transfer poorly because they don't capture the hierarchical dependencies that natural language syntax requires.
Constraint 2 (circuit complexity): The formal language must be learnable by transformers with length generalization. Transformers cannot learn all context-sensitive languages — both in theory and in practice. Many formal languages within the Chomsky hierarchy are either impossible for transformers to learn or can only be learned without length generalization. Pre-pretraining on formal languages that fall outside transformer computational limits may fail to transfer even if those languages are structurally appropriate.
The optimal transfer zone is the intersection of these two constraints: formal languages expressive enough to capture hierarchical dependencies (Chomsky), and learnable by transformers with length generalization (circuit complexity). The paper formalizes this using C-RASP, a restricted programming language whose functions allow length generalization.
Empirical support: formal languages satisfying both constraints achieve equal or better transfer than matched natural language training. Formal languages satisfying only constraint 1 (hierarchical but not in C-RASP) show equivalent or slightly worse performance on some evaluations.
The broader principle: architectural computational limits are not just engineering constraints — they determine what inductive biases can actually be learned. The Chomsky hierarchy describes what structures are grammatically relevant; the circuit complexity hierarchy describes what structures are architecturally learnable. Effective pre-pretraining requires both.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can explicit stack mechanisms extend what formal languages transformers can learn?
- What's the difference between formal and functional linguistic competence?
- Why do only context-sensitive formal languages transfer effectively to natural language?
- What formal language complexity level matches transformer computational limits best?
- What limits the effectiveness of formal language pretraining on transformer architectures?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can formal language pretraining make language models more efficient?
Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.
the empirical finding this explains
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
parallel: architectural limits determine what capabilities can be learned, not just compute
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
architectural constraints recur across training and inference
-
Can explicit stack tracking improve how transformers learn recursive syntax?
Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.
expands the architectural learnability boundary: the two-constraint model applies to standard transformers, but explicit stack tape extends transformer computational limits — potentially expanding the set of formal languages that produce positive transfer
-
Do formal language prototypes improve reasoning across different domains?
Can training language models on abstract reasoning patterns in Prolog and PDDL help them generalize to new reasoning tasks? This tests whether shared logical structures underlie seemingly different problem domains.
ProtoReasoning confirms the two-constraint model from the reasoning side: Prolog and PDDL satisfy both hierarchical structure (Chomsky) and learnability (transformer limits), producing 4-6% cross-domain gains; the formal languages that work for reasoning transfer are the same ones this analysis predicts should work for language acquisition transfer
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
- Ask, and it shall be given: Turing completeness of prompting
- Large Linguistic Models: Investigating LLMs' metalinguistic abilities
- Large Language Model Programs
- Compositional Reasoning with Transformers, RNNs, and Chain of Thought
Original note title
effective formal language pre-pretraining requires matching formal language complexity to transformer computational limits