Do looped transformers naturally converge to fixed points during inference?
This explores whether looped transformers — models that re-run the same layers over and over instead of stacking more — actually settle into a stable 'fixed point' as they iterate, and whether that settling is useful.
This explores whether looped transformers — models that re-run the same layers over and over instead of stacking more — actually settle into a stable 'fixed point' as they iterate, and whether that settling is useful. The short version from the corpus: yes, they tend to converge, and the convergence itself turns out to be a feature, not a side effect. Looped models re-apply a shared block of layers in 'recurrent depth,' and as the latent state stops changing meaningfully from one pass to the next, that quiet point becomes a natural signal that the model is done thinking Can models learn by looping instead of growing larger?.
The most direct evidence treats that convergence as a halting mechanism. Instead of training a model to emit a special 'stop' token, you can simply watch the latent state and stop when it reaches a fixed point — and this calibrates compute closer to where accuracy actually saturates, without any special training regime Can fixed points replace learned halt tokens in reasoning models?. So convergence isn't just something that happens; it's something you can read off to decide how long to keep iterating. Harder inputs take more passes before they settle, easier ones settle fast, which is exactly the adaptive-compute behavior you'd want.
Why does looping buy anything in the first place? Because plain feedforward transformers have no native way to carry an evolving state — they have to shove it deeper into more layers until they run out of depth, which is why they lean on chain-of-thought as an external scratchpad Why do transformers need explicit chain-of-thought reasoning?. Recurrence gives the state somewhere to live and keep refining, which is what lets looped and recurrent-depth models track state and generalize compositionally in ways parameter scaling alone can't Can looped transformers generalize to unseen knowledge combinations?. Convergence to a fixed point is the visible trace of that refinement reaching equilibrium.
There's an interesting wrinkle, though: convergence isn't automatically guaranteed — it can be engineered. World-model work that iterates a shared block to refine latent environment states adds spectral-norm constraints specifically to get formal stability guarantees, so the loop provably settles rather than drifting or oscillating Can looped computation replace parameter count in world models?. That's worth sitting with: 'naturally converge' is partly true and partly a design choice, because a loop can in principle diverge, and stable convergence is something architects build in.
The cross-domain framing the corpus surfaces is that depth-as-iteration is becoming its own scaling axis — and not all recurrence is single-fixed-point. Hierarchical models run two coupled timescales (slow planning, fast computation) to reach effective depths that fixed-depth transformers can't, escaping the complexity ceiling that limits them Can recurrent hierarchies achieve reasoning that transformers cannot?. So if you came for 'do loops converge,' the thing you didn't know you wanted to know is that the convergence point is doing double duty — it's both the answer and the off-switch.
Sources 6 notes
Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.
FPRM shows that looped transformers halt more accurately by detecting when their latent state reaches a fixed point, calibrating compute closer to the accuracy-saturation point than learned halt tokens without requiring special training regimes.
Feedforward transformers lack native recurrent state-tracking and must push evolving state deeper into layers, eventually exhausting depth. Explicit chain-of-thought externalizes this state into tokens as a costly patch for a structural deficiency.
Recurrent-depth transformers with shared parameters across iterations enable systematic generalization and depth extrapolation that vanilla transformers cannot achieve. This emerges through a sharp three-phase process: memorization, in-distribution, then out-of-distribution generalization.
LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.