INQUIRING LINE

Model Architecture and Internals · Training, RL, and Test-Time Scaling · Reasoning, Retrieval, and Evaluationcross-cluster

What tasks does recurrent depth solve that feedforward models cannot?

This explores what specific capabilities you get from re-applying the same layers in a loop (recurrent depth) — things that simply making a wider or even deeper fixed-stack feedforward network can't deliver.

This explores what recurrent depth — re-running the same layers in a loop rather than stacking more of them — buys you that a fixed feedforward pass cannot. The corpus points to one core answer: tasks that need *iterated state tracking and compositional reasoning*, where the right number of computation steps depends on the problem, not on the architecture. Looped models earn their gains on exactly these by reapplying layers until a stable answer emerges, with the convergence itself acting as a natural "I'm done" signal Can models learn by looping instead of growing larger?. Mechanistically, each loop settles into distinct fixed points and stable cyclic trajectories — the model learns to *re-enact and repeat* feedforward inference stages it would otherwise have to spend fresh parameters on, and this behavior emerges without being explicitly trained for it How do looped language models actually improve reasoning in depth?.

The sharpest evidence comes from puzzles that fixed-depth transformers provably can't crack. The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales and hits near-perfect performance on Sudoku and mazes — tasks where chain-of-thought collapses completely — with just 27M parameters and 1,000 training samples Can recurrent hierarchies achieve reasoning that transformers cannot?. The reason is a hard complexity ceiling: a transformer with a fixed number of layers lives in a bounded computational class (the AC0/TC0 world), so problems requiring many sequential dependent steps are simply out of reach no matter how wide you make it. Recurrence escapes that ceiling by letting depth grow at inference time.

What's striking is that this isn't just "add more layers." The corpus repeatedly shows that *reusing* computation beats *adding* it. In masked diffusion language models, selectively looping early-middle layers matches same-size baselines with 3.3× fewer training FLOPs and beats genuinely deeper non-looped models on reasoning Can looping layers beat adding depth in diffusion models?. So the win is specifically from iteration on a shared parameter set, not from raw parameter count — which is also why deep-and-thin still beats wide even in plain feedforward nets at small scale, by composing abstract concepts through layers Does depth matter more than width for tiny language models?.

There's a deeper hint about *why* iterated depth helps with compositional tasks: networks naturally decompose composable problems into modular subnetworks, each implementing an isolated subroutine Do neural networks naturally learn modular compositional structure?. Looping lets a model invoke and re-invoke those subroutines as many times as a problem demands — closer to running a program than evaluating a fixed function. And depth crossing critical thresholds doesn't just improve things gradually; in self-supervised RL it produces qualitative behavioral jumps (walking appears at depth 16, wall-climbing at depth 256), suggesting some capabilities only switch on once enough sequential computation is available Does network depth unlock qualitatively new behaviors in RL?.

The honest caveat the corpus also offers: recurrent depth is not a universal win. For tasks driven by structural relationships rather than multi-step computation — collaborative filtering, for instance — a single constrained linear layer beats deep models outright, because the right inductive bias matters more than computational depth Can a linear model beat deep collaborative filtering?. Recurrence pays off precisely when the task is *algorithmic* — state-tracking, planning, length-generalizing, problems where the answer requires running a variable number of dependent steps — not when it's about capacity or representation.

Sources 8 notes

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

How do looped language models actually improve reasoning in depth?

Each recurrent layer converges to distinct fixed points forming stable cyclic trajectories. Looped models learn to mirror and repeat feedforward inference stages rather than discover new computation, emerging naturally without explicit training.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can looping layers beat adding depth in diffusion models?

LoopMDM matches same-size masked diffusion models with 3.3× fewer training FLOPs and exceeds deeper non-looped baselines on reasoning tasks. Reusing computation through selective early-middle layer loops proves more effective than adding depth at fixed parameter budgets.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

What tasks does recurrent depth solve that feedforward models cannot?

Sources 8 notes

Next inquiring lines