INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Can recurrent blocks learn genuinely novel computation beyond repetition?

This explores whether looping a network's layers — re-running the same block over and over — can do more than repeat the same step, actually building up new kinds of computation a fixed-depth network can't reach. The corpus says yes, and the reason is more interesting than "more passes equals more work": recurrence changes what kind of problem the model can solve, not just how long it spends. Re-applying layers in depth lets a model track state and compose pieces of a solution, and on reasoning tasks this beats simply making a feedforward network bigger Can models learn by looping instead of growing larger?. The sharpest evidence is a 27M-parameter hierarchical model that couples slow planning with fast detailed steps across two timescales and hits near-perfect scores on Sudoku and mazes — problems where chain-of-thought completely fails Can recurrent hierarchies achieve reasoning that transformers cannot?. The framing there is precise: fixed-depth transformers are stuck under a complexity ceiling (the AC0/TC0 classes), and recurrence is how you climb out of it. So the loop isn't repeating a computation — it's reaching a class of computation the flat network is provably barred from.

Why does looping buy genuinely new capability rather than diminishing returns? Because depth is where compositional structure lives. Neural networks already tend to split tasks into modular sub-routines that live in separate sub-networks Do neural networks naturally learn modular compositional structure?, and iterated depth gives those modules room to chain. The contrast case is illuminating: when transformers *look* like they reason compositionally, they're often just memorizing computation subgraphs from training and they shatter on novel combinations Do transformers actually learn systematic compositional reasoning?. That's exactly the "repetition, not novelty" failure the question worries about — and it's a property of fixed-depth pattern-matching, which recurrence is meant to escape.

There's a deeper theoretical seam here worth pulling. A single finite transformer is already Turing-complete given the right prompt — the machinery for arbitrary computation is latent in the weights Can a single transformer become universally programmable through prompts?. The catch is that ordinary training almost never *teaches* a model to use that machinery. So "can recurrent blocks learn novel computation" splits into two questions: is the capability expressible (yes, in principle) and does the training regime actually instill it (usually the bottleneck). The corpus echoes this elsewhere — reasoning models beat non-reasoning ones at any inference budget because training installed a protocol that makes the extra compute productive, not because they have more raw horsepower Can non-reasoning models catch up with more compute?. Looping gives you the compute; the training has to give you something worth doing with each loop.

The most surprising turn is that recurrence can be repurposed for things that aren't prediction at all. One line of work uses recurrent passes with *no input tokens* to consolidate recent context into persistent fast weights — a sleep-like replay that separates memory consolidation from next-token prediction entirely Can recurrence consolidate memory without predicting tokens?. Relatedly, a feedback loop that lets a transformer attend to its own latent states grows emergent working memory for unbounded sequences, with no extra weights Can models learn working memory by attending to their own latents?. These aren't "do the same step again" — they're the loop becoming a substrate for a different function: holding state, replaying, consolidating. That's the strongest sense in which recurrent blocks learn novel computation beyond repetition — the iteration becomes the place where memory and planning happen, not just where a feedforward pass gets re-run.

Sources 8 notes

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can recurrence consolidate memory without predicting tokens?

Language models can use recurrent passes without input tokens to transfer recent context into persistent fast weights via learned local rules, mirroring hippocampal replay during biological sleep. This separates consolidation from prediction, enabling different scheduling and compute allocation.

Can models learn working memory by attending to their own latents?

TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.

Can recurrent blocks learn genuinely novel computation beyond repetition?

Sources 8 notes

Next inquiring lines