Why does recursion on latent states improve generalization more than scale?
This explores why letting a small network loop over its own internal reasoning state — re-running the same layers — produces better generalization than simply adding more parameters, and what the corpus says about depth/recursion as a scaling axis distinct from size.
This explores why recursion on latent states beats raw scale for generalization — and the corpus has a surprisingly consistent answer across very different architectures. The headline result is almost absurd: a 7-million-parameter, two-layer network that recurses on its own latent reasoning state reaches 45% on ARC-AGI-1, outscoring billion-parameter LLMs with 0.01% of their parameters Can tiny recursive networks outperform massive language models?. The authors are careful to attribute the gain to recursion itself — not scale, and not hierarchical structure. That's the puzzle worth unpacking.
The mechanism shows up again in looped architectures more broadly: re-applying the same layers in recurrent depth lets a model track and update an evolving state, which is exactly what compositional generalization needs and what just adding width can't buy Can models learn by looping instead of growing larger?. Width spreads parameters across more parallel features; depth-by-iteration lets a model compose abstract concepts step by step. That distinction is visible even at tiny scale, where deep-and-thin models beat balanced ones and contradict the naive 'bigger is better' scaling laws Does depth matter more than width for tiny language models?. The world-models work makes the trade explicit — iterating computation in a shared block yields up to 100x parameter efficiency by spending extra depth on the harder prediction steps, the way a physical system settles into a solution Can looped computation replace parameter count in world models?.
The deeper reason recursion works on *latent* states (not tokens) is about the medium being iterated over. There's a formal result that predicting your own latents recovers compositional structure with a number of samples that stays constant as hierarchy depth grows, while token-level learning needs exponentially more — because nearby latents are far more correlated than raw tokens Why is predicting latents more sample-efficient than tokens?. So latent space is a smoother, more structured place to do the looping, and each recursive pass refines something that already carries compositional signal. Latent-thought models lean on the same insight, treating latent size as a scaling dimension fully independent of parameter count Can latent thought vectors scale language models beyond parameters?.
Why can't scale just do this? Because the thing recursion buys you — modular, composable subroutines — isn't reliably produced by piling on parameters. Networks naturally decompose tasks into isolated subnetworks Do neural networks naturally learn modular compositional structure?, and modern nets genuinely compose despite having no explicit symbolic machinery Can neural networks actually achieve compositional generalization? — but huge LLMs still hit systematic linguistic blind spots that worsen predictably with structural depth, suggesting scale captures surface statistics rather than the deep recursive rules Why do large language models fail at complex linguistic tasks?. Recursion attacks that gap directly by making depth-of-reasoning a runtime variable instead of a fixed property baked into parameter count.
The thing you might not have expected: recursion isn't only an efficiency hack, it changes *how* a model spends effort on hard inputs. Looped world-models pour more iterations into harder steps Can looped computation replace parameter count in world models?, and separately, LLMs sparsify their hidden states adaptively when tasks get unfamiliar — a built-in mechanism for reconfiguring computation under load rather than failing Do language models sparsify their activations under difficult tasks?. Both hint that generalization comes from a model that can *vary its depth of thought*, which is precisely what iterating on a latent state allows and what a fixed forward pass through a giant network does not.
Sources 10 notes
A 7M-parameter two-layer network recursing on its latent reasoning state reached 45% on ARC-AGI-1, beating larger LLMs with 0.01% of their parameters. The gains come from recursion itself, not scale or hierarchical architecture.
Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.