What makes recurrent depth enable compositional generalization across tasks?
This explores whether adding computational depth through recurrence — reusing the same layers across multiple passes — is what lets a model build answers compositionally and carry that skill from one task to another, and what the corpus actually credits for that ability.
This explores whether recurrent depth — looping computation through the same layers rather than stacking ever more of them — is what unlocks compositional generalization, and the corpus suggests the honest answer is: depth helps, but recurrence is doing something more specific than just "more depth." The cleanest case for it comes from the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot?, which couples a slow planning loop with a fast computation loop and solves Sudoku and mazes that chain-of-thought transformers fail completely — with only 27M parameters. The mechanism isn't size; it's that fixed-depth transformers sit under a complexity ceiling (AC0/TC0), and recurrence lets the model spend variable amounts of computation re-applying the same learned operation until a multi-step problem is resolved. Compositional tasks are exactly the ones that need that variable re-application: each step reuses an operation the model already knows.
That reuse framing is the thread that ties the rest of the corpus together. Even in plain transformers, the ingredients of composition show up as reused machinery: models that generalize to longer inputs do it by reusing the same attention heads across related tasks, so a short task can lend its scaffolding to a longer one Can length generalization transfer between different related tasks?. And networks tend to carve compositional problems into isolated modular subnetworks — ablate one and only its sub-function breaks Do neural networks naturally learn modular compositional structure?. Recurrent depth can be read as a way to call those modular subroutines repeatedly instead of having to lay down a fresh copy of each at every layer, which is why depth pays off disproportionately at small scale: deep-and-thin models beat balanced ones by composing abstract concepts through successive layers rather than spreading parameters across width Does depth matter more than width for tiny language models?.
But the corpus also pushes back hard against crediting architecture alone. One line of work shows that ordinary MLPs achieve compositional generalization through data and model scale with no architectural tricks at all — as long as training covers enough combinations of the task modules, and you can predict success just by checking whether the constituents are linearly decodable from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. So what "recurrent depth" may really be buying is a cheaper, sample-efficient route to the same separable internal representations that scale would otherwise brute-force into existence.
The darker reading is that much of what looks like composition isn't. Transformers often succeed by memorizing computation subgraphs from training and linearizing them into pattern-matches, then collapse on genuinely novel compositions with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. The deeper diagnosis is the binding problem: networks struggle to dynamically bind distributed information into reusable structure, segregate entities, and recombine them in new ways Why do neural networks fail at compositional generalization?. Read against this, recurrent depth's real contribution is that iterative re-application gives the model more opportunities to bind and re-bind intermediate results across passes — partially routing around the very failure that limits fixed-depth networks.
So the takeaway you might not have gone looking for: the field has quietly stopped asking whether neural nets can compose at all — modern systems clearly do sophisticated syntax, logic, and code — and is now arguing about *how*, without explicit symbolic structure Can neural networks actually achieve compositional generalization?. Recurrent depth isn't a magic ingredient; it's a bet that re-running learned operations under a planning loop reaches genuine multi-step composition with far less data and far fewer parameters than scaling width — and on bounded, deeply-compositional puzzles, that bet is currently winning.
Sources 8 notes
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.