Can recursion alone drive generalization better than model scale?
This explores whether re-applying a model's computation in a loop — recursion — can produce better generalization than simply making the model bigger, and what the corpus says about where each strategy's power actually comes from.
This explores whether recursion can beat raw scale at the thing we actually care about — generalizing to problems the model hasn't seen — and the corpus offers an unusually clean answer: on the right kind of task, yes, and by margins that make scale look almost beside the point. The sharpest data point is a 7M-parameter, two-layer network that recurses on its own latent reasoning state and reaches 45% on ARC-AGI-1, outperforming language models with roughly ten-thousand times more parameters Can tiny recursive networks outperform massive language models?. The authors isolate the cause: the gain comes from recursion itself, not from scale and not even from hierarchical structure. Looped architectures tell the same story from the other direction — re-applying the same layers in recurrent depth beats larger feedforward networks on reasoning, because iterating lets the model track state and compose steps in ways that just adding parameters cannot Can models learn by looping instead of growing larger?.
The deeper reason is that depth and width are not interchangeable, even though scaling laws often treat parameters as fungible. MobileLLM found that for sub-billion-parameter models, going deep-and-thin beats going wide, composing abstract concepts through successive layers rather than spreading capacity sideways Does depth matter more than width for tiny language models?. Recursion is depth taken to its limit — the same transformation applied over and over — so it inherits exactly the property that makes depth valuable: it builds compositional structure across iterations instead of memorizing more patterns in parallel. That matters because the failure being escaped is specific. Transformers often succeed at compositional tasks by quietly reducing them to matching memorized computation subgraphs, which works in-distribution and collapses on genuinely novel combinations Do transformers actually learn systematic compositional reasoning?. Scale makes that memorization deeper; it doesn't change its nature. And on truly iterative work — running a numerical method step by step — LLMs don't actually iterate at all; they recognize a problem as template-similar and emit plausible-but-wrong answers, a failure that persists across model scale Do large language models actually perform iterative optimization?.
But the corpus also refuses to let recursion become a magic word, and this is the part worth slowing down for: recursion only generalizes when it's coupled to the right learning objective. GRAM's ablations show that bolting naive stochasticity onto a recursive model yields nothing — the gains appear only when the recurrence is tied to a principled variational objective, not when noise is sprinkled on top Does adding randomness alone improve recursive reasoning models?. In other words, 'recurse more' is not the lever; 'recurse with a training signal that makes each iteration mean something' is. The same lesson echoes in how small models close the gap with large ones elsewhere — DPO on teacher-generated right-and-wrong pairs lets small models match big ones on function calling not by adding capacity but by giving the training signal sharper structure Can small models match large models on function calling?.
There's also a ceiling worth naming. Recursion lets a model keep thinking, but more thinking is not unbounded self-improvement: a model's ability to fix itself is formally limited by the gap between generating an answer and verifying it, and no amount of internal looping escapes that without an external check What stops large language models from improving themselves?. So recursion buys you compositional depth and state-tracking that scale can't, but it doesn't buy you a way out of needing ground truth.
The quietly subversive takeaway: the field's reflex — that generalization is a resource you purchase with parameters — is at least partly an artifact of which architectures we measured. Once you let a model reuse its computation, a network small enough to ignore can out-generalize ones built to dominate, and the interesting frontier shifts from 'how big' to 'how many times, toward what objective.' If you want to pull this thread further, the work on whether neural nets compose at all without explicit symbolic structure is the natural next stop Can neural networks actually achieve compositional generalization?.
Sources 9 notes
A 7M-parameter two-layer network recursing on its latent reasoning state reached 45% on ARC-AGI-1, beating larger LLMs with 0.01% of their parameters. The gains come from recursion itself, not scale or hierarchical architecture.
Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
GRAM's ablations show naive stochasticity added to existing models yields no improvement. Gains come specifically from amortized variational inference, which couples stochastic latents to a principled generative objective rather than injecting undirected noise.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.