INQUIRING LINE

What makes recursive structure different from other forms of compositional generalization?

This explores why recursion — structure that nests inside itself, like a clause within a clause — seems to resist the scale-it-up cure that works for ordinary 'mix-and-match' compositional generalization.


This reads the question as asking what's special about *recursive* structure (a thing made of smaller copies of itself — a clause embedded in a clause, an operation applied to its own output) compared to the broader family of compositional generalization (recombining known parts in novel arrangements). The corpus draws a surprisingly clean line between the two, and it runs along the axis of *depth*.

For garden-variety compositional generalization, the encouraging news is that you may not need anything clever. Plain networks reach it through data and model scale alone, as long as the training distribution covers enough combinations of the underlying modules Can neural networks learn compositional skills without symbolic mechanisms? — and networks even tend to spontaneously sort compositional tasks into isolated, reusable subnetworks Do neural networks naturally learn modular compositional structure?. The catch is that much of this 'generalization' turns out to be memorized computation reused — transformers often succeed by matching linearized subgraphs they saw in training, then fail hard on genuinely novel compositions Do transformers actually learn systematic compositional reasoning?. Greff and colleagues frame the deeper obstacle as the *binding problem*: networks struggle to dynamically bind distributed pieces into fresh structures and reuse that structure in new combinations Why do neural networks fail at compositional generalization?.

Recursion is where this story breaks in a specific, telling way. The single sharpest data point in the corpus: Pushdown Layers add an explicit stack to attention and win 3–5x on *syntactic* generalization — and the authors note this matters precisely because recursive structure benefits from a built-in architectural bias even though general compositional generalization emerges from scale Can explicit stack tracking improve how transformers learn recursive syntax?. In other words: scale buys you recombination; it does not buy you nesting. The symptom shows up everywhere recursion appears — LLMs degrade *predictably* as syntactic depth and embedding increase, handling simple sentences fine while consistently botching deeply embedded clauses Does LLM grammatical performance decline with structural complexity?, misidentifying embedded clauses and complex nominals as structure stacks up Why do large language models fail at complex linguistic tasks?.

The reason recursion is different has a name in complexity theory. A fixed-depth transformer lives under the AC0/TC0 ceiling — it has a bounded number of sequential computation steps no matter how wide you make it. Recursion needs *re-application*: feeding an operation its own output, an unbounded number of times. The fixes that work are the ones that restore depth rather than width. Looped, parameter-shared 'recurrent-depth' transformers achieve systematic generalization and can extrapolate to deeper inputs than they trained on, emerging through a sharp three-phase grokking process Can looped transformers generalize to unseen knowledge combinations?. The Hierarchical Reasoning Model couples slow and fast recurrent timescales to escape that exact AC0/TC0 ceiling, nailing Sudoku and mazes where chain-of-thought collapses — with only 27M parameters Can recurrent hierarchies achieve reasoning that transformers cannot?.

So the thing you might not have known you wanted to know: compositional generalization and recursion fail for *different reasons*, and they want *different cures*. Recombination is a coverage problem — show the model enough of the space and breadth scaling closes the gap. Recursion is a *depth* problem — and you can't paper over it with more parameters or more data; you have to give the architecture a way to apply itself to its own output, whether that's an explicit stack Can explicit stack tracking improve how transformers learn recursive syntax? or genuine recurrence Can looped transformers generalize to unseen knowledge combinations? Can recurrent hierarchies achieve reasoning that transformers cannot?.


Sources 9 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can explicit stack tracking improve how transformers learn recursive syntax?

Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can looped transformers generalize to unseen knowledge combinations?

Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether the recursion–vs–composition boundary holds in current LLM practice. The question: *What architectural or training moves have shifted how language models handle recursive structure relative to mere compositional recombination?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026. A curated library identified these constraints:
• Plain scaling closes compositional-generalization gaps but NOT recursive-depth failures; transformers plateau on embedded clauses (2025–2026).
• Pushdown Layers (explicit stack) yield 3–5× syntax generalization over standard attention; recursion benefits from architectural bias where composition emerges from scale alone (2023).
• Recurrent-depth transformers with parameter-shared looping achieve extrapolation to unseen depths via sharp grokking; Hierarchical Reasoning Model couples dual timescales to breach AC0/TC0 limits (2025–2026).
• LLM grammatical competence degrades predictably with syntactic depth; systematic blind spots worsen with structural complexity (2025).
• Recursion is a *depth* problem (needs re-application), not a coverage problem (2023–2026).

Anchor papers (verify; mind their dates):
• arXiv:2310.19089 (Pushdown Layers, 2023)
• arXiv:2604.07822 (Recurrent-Depth Transformers, 2026)
• arXiv:2506.21734 (Hierarchical Reasoning Model, 2025)
• arXiv:2503.19260 (Linguistic Blind Spots, 2025)

Your task:
(1) RE-TEST each constraint: For Pushdown Layers, recurrent-depth, and Hierarchical Reasoning, has wider adoption, inference tooling, or hybrid architectures since relaxed the need for explicit depth mechanisms? Do current frontier models (GPT-4-class, o1-class) show recursion failures that match 2025 baselines, or have post-training/reasoning-time interventions masked the depth problem? Where does the constraint still *provably* hold?
(2) Surface the strongest *contradicting* or *superseding* work (last 6 months): any paper showing plain scaling, novel attention patterns, or in-context reasoning techniques that close recursive-structure gaps *without* architectural redesign?
(3) Propose 2 questions that assume the regime may have moved: (a) If reasoning-time depth (chain-of-thought, tree search, learned unrolling) has become the new bottleneck, how do we measure it separately from weight-based architectural depth? (b) Do foundation models trained on code or formal proofs exhibit different recursion-handling signatures than language-trained ones?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines