INQUIRING LINE

Training, RL, and Test-Time Scaling · Model Architecture and Internals · Reasoning, Retrieval, and Evaluationcross-cluster

How does hierarchical recurrence compare to selective layer looping for computational depth?

This explores two ways to get more 'thinking depth' out of a small network without adding parameters — stacking two recurrent timescales (hierarchical recurrence) versus re-running a chosen subset of layers in a loop (selective layer looping) — and what the corpus says about how they differ.

This explores two strategies for buying computational depth on the cheap: hierarchical recurrence, which couples a slow planning loop with a fast detail loop, versus selective layer looping, which simply re-applies a chosen band of layers more times. Both reject the 'just add parameters' reflex, but they bet on different mechanisms — and the corpus increasingly suggests the simpler one is doing more of the work.

Hierarchical recurrence is the more elaborate design. The Hierarchical Reasoning Model runs two coupled timescales — slow abstract planning over fast detailed computation — to break past the fixed-depth ceiling that limits ordinary transformers, hitting near-perfect Sudoku and maze solving with only 27M parameters Can recurrent hierarchies achieve reasoning that transformers cannot?. The promise is that the two-level structure is what unlocks the reasoning. But a striking follow-up undercuts that: a 7M-parameter two-layer network that just recurses on its own latent state matched or beat those results, and the authors pin the gain on recursion itself — not on the hierarchy Can tiny recursive networks outperform massive language models?. In other words, the depth-from-iteration mattered; the two-timescale scaffolding was closer to optional.

Selective layer looping makes that minimalism the whole point. Instead of looping the entire stack, LoopMDM reuses just the early-middle layers and matches same-size diffusion models with 3.3× fewer training FLOPs while beating deeper non-looped baselines Can looping layers beat adding depth in diffusion models?. This sits inside a broader finding that looped architectures generally win their depth through iteration rather than scale: re-applying layers enables the state-tracking and compositional generalization that parameter growth alone misses Can models learn by looping instead of growing larger?, and the looped layers tend to settle into stable cyclic fixed points that re-enact feedforward inference stages rather than inventing wholly new computation How do looped language models actually improve reasoning in depth?. The same axis scales world models up to 100× more parameter-efficiently by spending extra loop iterations on the harder prediction steps Can looped computation replace parameter count in world models?.

So the comparison resolves toward a deflationary answer: hierarchical recurrence and selective looping are both ways of converting iteration into effective depth, and the evidence says iteration is the load-bearing ingredient — the hierarchy is a particular (and possibly unnecessary) way of organizing it. This matters because depth has real costs. Depth genuinely beats width for tiny models by composing concepts through layers Does depth matter more than width for tiny language models?, and in RL more depth can unlock qualitatively new behaviors at critical thresholds Does network depth unlock qualitatively new behaviors in RL? — but depth is serial and slow.

That serial cost is where the most interesting tension lives, and it suggests neither approach is the final word. Pure depth-only reasoning can 'underthink,' so allocating compute to breadth — diverse abstractions or parallel latent trajectories — sometimes beats deepening a single path Can abstractions guide exploration better than depth alone? Can reasoning systems scale faster by exploring parallel paths instead?. And the gains from these recursive schemes aren't automatic: with stochastic recursive models, randomness alone does nothing — it's the principled variational training around the recursion that delivers Does adding randomness alone improve recursive reasoning models?. The thing you didn't know you wanted to know: the field's headline 'hierarchical reasoning' result may be hierarchy-optional, and the live frontier is no longer hierarchy-vs-loop but depth-vs-width — how much to iterate one path versus explore many.

Sources 11 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can tiny recursive networks outperform massive language models?

A 7M-parameter two-layer network recursing on its latent reasoning state reached 45% on ARC-AGI-1, beating larger LLMs with 0.01% of their parameters. The gains come from recursion itself, not scale or hierarchical architecture.

Can looping layers beat adding depth in diffusion models?

LoopMDM matches same-size masked diffusion models with 3.3× fewer training FLOPs and exceeds deeper non-looped baselines on reasoning tasks. Reusing computation through selective early-middle layer loops proves more effective than adding depth at fixed parameter budgets.

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

How do looped language models actually improve reasoning in depth?

Each recurrent layer converges to distinct fixed points forming stable cyclic trajectories. Looped models learn to mirror and repeat feedforward inference stages rather than discover new computation, emerging naturally without explicit training.

Can looped computation replace parameter count in world models?

LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can reasoning systems scale faster by exploring parallel paths instead?

GRAM demonstrates that recursive reasoning models should maintain and explore multiple latent trajectories in parallel, not only deepen single paths. Width-scaling avoids the serial latency penalty of depth while sampling the solution distribution more effectively on ambiguous problems.

Does adding randomness alone improve recursive reasoning models?

GRAM's ablations show naive stochasticity added to existing models yields no improvement. Gains come specifically from amortized variational inference, which couples stochastic latents to a principled generative objective rather than injecting undirected noise.

How does hierarchical recurrence compare to selective layer looping for computational depth?

Sources 11 notes

Next inquiring lines