What makes recursive depth more effective than parametric depth for puzzles?
This explores why looping a small network back over its own intermediate state ('recursive depth') beats simply stacking more layers or piling on parameters ('parametric depth') on structured puzzles like ARC-AGI.
This explores why looping a small network back over its own intermediate state beats simply stacking more layers or piling on parameters on hard puzzles. The corpus's sharpest data point is a 7-million-parameter, two-layer network that recurses on its own latent reasoning state and reaches 45% on ARC-AGI-1 — outscoring billion-parameter LLMs with roughly 0.01% of their size, with the gain traced to the recursion itself rather than to scale or any hierarchical trick Can tiny recursive networks outperform massive language models?. The reason this isn't a fluke: re-applying the same layers in depth gives a model something parameter scaling alone can't buy — state tracking and compositional generalization, with the network's own convergence acting as a natural 'I'm done' signal Can models learn by looping instead of growing larger?.
The mechanism becomes concrete when you watch looped transformers run. Each recurrent pass settles into distinct fixed points that form stable cyclic trajectories — the model is re-enacting and repeating stages of feedforward inference rather than inventing brand-new computation, and this behavior emerges on its own without being explicitly trained for it How do looped language models actually improve reasoning in depth?. In other words, recursion lets a small set of weights be spent many times on a problem, adapting compute to the instance, instead of freezing one fixed pass through a huge stack. The efficiency payoff shows up directly under matched budgets: in masked diffusion models, selectively looping early-to-middle layers matches same-size baselines with 3.3× fewer training FLOPs and beats deeper non-looped models on reasoning Can looping layers beat adding depth in diffusion models?.
This isn't a blanket verdict that depth is useless — and the corpus is honest about that. In self-supervised RL, scaling to 1000 layers produces genuine qualitative jumps at critical thresholds (depth 16 unlocks walking, depth 256 unlocks wall-climbing), so raw depth can buy new behaviors when expressivity and exploration are the bottleneck Does network depth unlock qualitatively new behaviors in RL?. The puzzle-solving case is different because the bottleneck is iterative refinement of a solution state, which is exactly what a loop supplies cheaply and a static parameter stack supplies only by brute force.
What's worth knowing is that pure recursive depth has its own ceiling, and the fix is to grow sideways, not just downward. Recursive reasoning models do better when they sample multiple latent trajectories in parallel — scaling in *width* — which dodges the serial latency penalty of ever-deeper single chains and samples the solution distribution far better on ambiguous problems Can reasoning systems scale faster by exploring parallel paths instead?. And the gains from that stochastic recursion don't come from sprinkling in randomness; ablations show naive noise does nothing, while a principled variational training objective that couples the latent state to a generative goal is what actually drives improvement Does adding randomness alone improve recursive reasoning models?. A complementary line argues the win is breadth of *strategy*: allocating test-time compute to diverse abstractions outperforms simply sampling more solutions, because it prevents the 'underthinking' failure mode of depth-only chains Can abstractions guide exploration better than depth alone?.
The twist the reader may not expect: more reasoning steps don't reliably mean harder thinking. Trace length tracks how close a problem sits to the training distribution, not its actual difficulty — it decouples entirely out-of-distribution — so longer chains often reflect recalled schemas rather than adaptive computation Does longer reasoning actually mean harder problems?. That reframes the whole comparison: recursive depth wins on puzzles not because 'more steps = smarter,' but because re-running a small reasoning loop until it converges is a genuinely adaptive, compositional use of compute, where adding parameters or fixed depth mostly buys memorized pattern-matching.
Sources 9 notes
A 7M-parameter two-layer network recursing on its latent reasoning state reached 45% on ARC-AGI-1, beating larger LLMs with 0.01% of their parameters. The gains come from recursion itself, not scale or hierarchical architecture.
Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.
Each recurrent layer converges to distinct fixed points forming stable cyclic trajectories. Looped models learn to mirror and repeat feedforward inference stages rather than discover new computation, emerging naturally without explicit training.
LoopMDM matches same-size masked diffusion models with 3.3× fewer training FLOPs and exceeds deeper non-looped baselines on reasoning tasks. Reusing computation through selective early-middle layer loops proves more effective than adding depth at fixed parameter budgets.
Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.
GRAM demonstrates that recursive reasoning models should maintain and explore multiple latent trajectories in parallel, not only deepen single paths. Width-scaling avoids the serial latency penalty of depth while sampling the solution distribution more effectively on ambiguous problems.
GRAM's ablations show naive stochasticity added to existing models yields no improvement. Gains come specifically from amortized variational inference, which couples stochastic latents to a principled generative objective rather than injecting undirected noise.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.