INQUIRING LINE

What makes recurrent depth enable compositional generalization across tasks?

This explores whether adding computational depth through recurrence — reusing the same layers across multiple passes — is what lets a model build answers compositionally and carry that skill from one task to another, and what the corpus actually credits for that ability.


This explores whether recurrent depth — looping computation through the same layers rather than stacking ever more of them — is what unlocks compositional generalization, and the corpus suggests the honest answer is: depth helps, but recurrence is doing something more specific than just "more depth." The cleanest case for it comes from the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot?, which couples a slow planning loop with a fast computation loop and solves Sudoku and mazes that chain-of-thought transformers fail completely — with only 27M parameters. The mechanism isn't size; it's that fixed-depth transformers sit under a complexity ceiling (AC0/TC0), and recurrence lets the model spend variable amounts of computation re-applying the same learned operation until a multi-step problem is resolved. Compositional tasks are exactly the ones that need that variable re-application: each step reuses an operation the model already knows.

That reuse framing is the thread that ties the rest of the corpus together. Even in plain transformers, the ingredients of composition show up as reused machinery: models that generalize to longer inputs do it by reusing the same attention heads across related tasks, so a short task can lend its scaffolding to a longer one Can length generalization transfer between different related tasks?. And networks tend to carve compositional problems into isolated modular subnetworks — ablate one and only its sub-function breaks Do neural networks naturally learn modular compositional structure?. Recurrent depth can be read as a way to call those modular subroutines repeatedly instead of having to lay down a fresh copy of each at every layer, which is why depth pays off disproportionately at small scale: deep-and-thin models beat balanced ones by composing abstract concepts through successive layers rather than spreading parameters across width Does depth matter more than width for tiny language models?.

But the corpus also pushes back hard against crediting architecture alone. One line of work shows that ordinary MLPs achieve compositional generalization through data and model scale with no architectural tricks at all — as long as training covers enough combinations of the task modules, and you can predict success just by checking whether the constituents are linearly decodable from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. So what "recurrent depth" may really be buying is a cheaper, sample-efficient route to the same separable internal representations that scale would otherwise brute-force into existence.

The darker reading is that much of what looks like composition isn't. Transformers often succeed by memorizing computation subgraphs from training and linearizing them into pattern-matches, then collapse on genuinely novel compositions with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. The deeper diagnosis is the binding problem: networks struggle to dynamically bind distributed information into reusable structure, segregate entities, and recombine them in new ways Why do neural networks fail at compositional generalization?. Read against this, recurrent depth's real contribution is that iterative re-application gives the model more opportunities to bind and re-bind intermediate results across passes — partially routing around the very failure that limits fixed-depth networks.

So the takeaway you might not have gone looking for: the field has quietly stopped asking whether neural nets can compose at all — modern systems clearly do sophisticated syntax, logic, and code — and is now arguing about *how*, without explicit symbolic structure Can neural networks actually achieve compositional generalization?. Recurrent depth isn't a magic ingredient; it's a bet that re-running learned operations under a planning loop reaches genuine multi-step composition with far less data and far fewer parameters than scaling width — and on bounded, deeply-compositional puzzles, that bet is currently winning.


Sources 8 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can length generalization transfer between different related tasks?

Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can neural networks actually achieve compositional generalization?

DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating claims about recurrent depth and compositional generalization. The question remains open: what mechanism in recurrent computation—if anything distinctive—enables compositional generalization across tasks, and is it fundamentally different from scaling or architectural priors?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat each as perishable.

• Recurrent depth couples slow planning with fast computation loops, solving Sudoku/mazes unsolvable by fixed-depth transformers at 27M params—suggesting variable re-application of learned operations is the key, not model size (~2025).

• Compositional generalization also emerges in plain MLPs via data & model scale alone, contingent on training covering task-module combinations and linear decodability of constituents—implying recurrence may be a sample-efficient proxy for scale rather than a necessity (~2024).

• Transformers often succeed by memorizing and linearizing computation subgraphs, collapsing on genuinely novel compositions; the binding problem—failure to dynamically bind distributed information into reusable structure—is the deeper constraint (~2023–2024).

• Recurrent depth's gain is iterative re-binding of intermediate results across passes, partially routing around binding failures in fixed-depth networks (~2025).

• Modern systems have clearly succeeded at syntax, logic, and code composition; the field has shifted from "can nets compose?" to "how, and at what sample/compute cost?" (~2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2012.05208 (2020) — The Binding Problem in Artificial Neural Networks
- arXiv:2305.18654 (2023) — Faith and Fate: Limits of Transformers on Compositionality
- arXiv:2506.21734 (2025) — Hierarchical Reasoning Model
- arXiv:2507.07207 (2026) — Scaling can lead to compositional generalization

Your task:

(1) RE-TEST EACH CONSTRAINT. For the 27M-parameter Sudoku result, has post-2025 work shown that standard transformers + chain-of-thought or in-context prompting, or newer RL/verification pipelines, have closed that gap without recurrence? Does linear decodability of constituents still predict compositional success in modern dense/sparse models, or do emergent binding mechanisms (e.g., in vision–language or multi-modal settings) relax it? Does the binding problem persist as stated, or have architectural innovations (linear attention, structured state spaces, mixture-of-experts) partially dissolved it? Separate the durable question (how do nets re-use computation) from perishable claims (recurrence is necessary).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: has any recent paper shown that recurrent depth offers no sample-efficiency advantage over width + scale, or that the Sudoku/maze results are orthogonal to real-world compositional tasks (code, math, language)?

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If binding can be solved via post-hoc verification or RL rather than architecture, do we need recurrent depth at all for compositional generalization in deployment? (b) Do recurrent depth and retrieval-augmented generation or tool-use achieve compositional generalization via the same or different mechanisms?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines