How do looped transformer layers actually behave during inference?
When language models loop their layers to improve reasoning, do they discover new computations or repeat existing ones? Understanding the internal dynamics could explain why recurrent architectures outperform simple depth scaling.
Looping an LLM's layers in the latent dimension improves reasoning, but it has been unclear how the internal dynamics differ from a standard feedforward model. This mechanistic analysis answers through the lens of stages of inference — the idea that LLM computation decomposes into distinct computational stages.
The core result is geometric. For many looped models, each layer in the cycle converges to a distinct fixed point, so the recurrent block follows a consistent cyclic trajectory in latent space. As those fixed points are reached, attention-head behavior stabilizes, producing constant behavior across recurrences. And empirically the recurrent blocks learn stages of inference that closely mirror feedforward models — repeating those stages in depth with each iteration. This appears to be emergent: it shows up even when training does not explicitly encourage it. The repeated application of a shared block necessarily implies one of two regimes — either the block's contribution vanishes asymptotically, or it traces a constant cyclic trajectory.
The implication that matters: recurrent depth is learned re-application of computation, not the discovery of genuinely new computation per loop. The loop re-runs the same inferential stages rather than adding qualitatively different ones. This is the mechanistic complement to Does looping layers beat adding depth in diffusion models?: it explains why reused computation can match or beat added depth — the network was re-enacting stages anyway, and looping makes that reuse explicit and parameter-free. Recurrent block size, input injection, and normalization govern whether these cyclic fixed points emerge and stay stable.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does looping computation outperform adding more transformer layers?
- Can recurrent transformers learn genuinely new computations beyond inference stages?
- Why does reapplying the same transformer block work better than computing new layers?
- Can looping enable reasoning capabilities that fixed-depth transformers fundamentally cannot achieve?
- How does selective looping in diffusion models differ from recurrence in autoregressive architectures?
- What computational stages does a looped block re-enact across multiple iterations?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does looping layers beat adding depth in diffusion models?
When scaling masked diffusion language models with fixed parameters, is reusing computation through selective layer looping more efficient than simply making the network deeper? This matters because it challenges conventional scaling assumptions.
the empirical payoff this analysis explains
-
Can looped transformers generalize to unseen knowledge combinations?
Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.
both probe what recurrent depth actually computes
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
adjacent account of effective depth through recurrence
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- A Mechanistic Analysis of Looped Reasoning Language Models
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Generative Recursive Reasoning
- Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
- Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Original note title
looped transformers re-enact feedforward stages of inference in depth converging to cyclic fixed points — recurrent depth is learned re-application not new computation