Can latent recurrence achieve the depth that standard transformers cannot?
This explores whether re-applying a model's layers over its own hidden state ('latent recurrence') can reach reasoning depths that fixed-depth transformers are mathematically barred from — and where that trick stops paying off.
This explores whether looping a model over its own latent state can buy the computational depth that a standard, fixed-depth transformer can't — and the corpus gives a surprisingly strong yes, with caveats. The cleanest result is the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot?, which couples slow abstract planning with fast detailed computation across two timescales and nearly solves Sudoku and mazes that chain-of-thought methods fail outright — all with 27M parameters and 1,000 training samples. The key claim isn't 'it's better,' it's *why*: ordinary transformers are stuck in a low complexity class (AC0/TC0) because their depth is fixed, and recurrence lets effective depth grow with the problem instead of with the parameter count. The looping literature agrees from a different angle: re-applying layers in recurrent depth outperforms larger feedforward networks on reasoning, because recursion enables state tracking and compositional generalization that scaling alone never reaches Can models learn by looping instead of growing larger?.
What's interesting is *what the loop actually does*. One study finds looped transformers don't invent new computation when iterated — each recurrent pass converges to a stable fixed point, and the model essentially re-enacts and repeats the feedforward inference stages it would otherwise run once How do looped language models actually improve reasoning in depth?. So 'depth' here means giving the model more passes to settle into an answer, not new machinery. That reframes recurrence as a way to spend more compute at inference on a problem that needs it, with the convergence signal doubling as a natural stopping rule.
The same 'attend to your own latents' idea pays off beyond reasoning puzzles. Adding a feedback loop so a transformer attends to its own latent representations gives it emergent working memory for indefinitely long inputs — no extra weights, and it helps at 1B, 8B, and 24B scale Can models learn working memory by attending to their own latents?. And the depth advantage may start before inference: predicting your own latents rather than raw tokens recovers compositional hierarchies with a number of samples that stays *constant* in hierarchy depth, while token-level learning needs exponentially more Why is predicting latents more sample-efficient than tokens?. The latent space is where the structure lives, so working in it — whether by looping or by predicting it — is where the leverage is.
The honest counterweight: recurrence trades one limit for another. A fixed-size recurrent latent state is provably worse than attention at copying and retrieving from long context — two-layer transformers can copy exponentially long strings that state-space models structurally can't Can state-space models match transformers at copying and retrieval?. So 'depth' and 'memory' pull in opposite directions: recurrence buys iterated reasoning depth but can bottleneck on faithful recall, which is why hybrid designs bolt separate long-term memory modules onto attention rather than asking one looped state to do everything Can neural memory modules scale language models beyond attention limits?.
The deeper takeaway you might not expect: depth alone doesn't guarantee *systematic* reasoning. Transformers often succeed by memorizing computation subgraphs from training and collapse on genuinely novel compositions Do transformers actually learn systematic compositional reasoning?. Latent recurrence raises the ceiling on how much computation a small model can perform per problem, and on the puzzle benchmarks that ceiling matters enormously — but whether the extra depth is spent on real algorithmic reasoning or just deeper pattern-matching is the open question the field is still wrestling with.
Sources 8 notes
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.
Each recurrent layer converges to distinct fixed points forming stable cyclic trajectories. Looped models learn to mirror and repeat feedforward inference stages rather than discover new computation, emerging naturally without explicit training.
TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.