INQUIRING LINE

How does dynamic recurrence during training improve depth extrapolation?

This reads as a question about how letting a model loop its own computation (recurrence) — rather than stacking a fixed number of layers — lets it reason at greater effective depth than its trained size would suggest, and the corpus circles this idea from several angles even though no single note nails the exact phrase.


This explores why looping a model's computation during training can let it reason "deeper" at test time than a fixed-layer transformer ever could — and the corpus approaches that idea from a few different directions rather than one tidy answer. The cleanest anchor is the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot?, which couples a slow planning loop with a fast computation loop across two timescales. Because the computation recurs instead of running once through fixed layers, the model reaches an *effective* depth far beyond its 27M parameters — enough to solve Sudoku and mazes that chain-of-thought methods fail outright. The key claim there is that fixed-depth transformers are pinned under a complexity ceiling (the AC0/TC0 limit), and recurrence is the lever that escapes it. That's the heart of "depth extrapolation": the trained network is small, but the unrolled computation can be made arbitrarily deep.

Why depth itself matters — and why getting more of it is worth the trouble — shows up vividly in Does network depth unlock qualitatively new behaviors in RL?. Scaling self-supervised RL networks to extreme depth doesn't yield gradual gains; it produces sudden jumps at specific thresholds (depth 16 unlocks walking, depth 256 unlocks wall-climbing). Capability appears to be gated behind reachable depth. Recurrence is attractive precisely because it offers a cheaper route to that depth: instead of physically stacking a thousand layers, you reuse a smaller block many times.

The corpus also offers a useful counter-current — depth isn't the only axis worth scaling. Can reasoning systems scale wider instead of only deeper? argues that sampling parallel latent trajectories (width) sidesteps the serial latency cost of going deeper, matching the benefits of depth without the wait. And Can abstractions guide exploration better than depth alone? shows that pure depth-only reasoning chains hit an "underthinking" failure mode that breadth-first abstraction avoids. Read against the recurrence papers, these suggest the real win isn't depth for its own sake but a controllable compute budget — and recurrence is one way to make depth a dial you can turn at inference rather than a fixed architectural choice.

There's a quieter, more surprising thread too: recurrence doesn't have to be in service of predicting the next token at all. Can recurrence consolidate memory without predicting tokens? describes recurrent passes that run *without input tokens*, transferring recent context into persistent fast weights — like hippocampal replay during sleep. This reframes "dynamic recurrence during training" as something broader than deeper forward passes: recurrence can be scheduled, allocated, and repurposed for consolidation, which is itself a way of building reusable depth that the model carries forward.

One honest caveat: none of these notes runs the specific experiment of varying recurrence depth at train time and measuring extrapolation to unseen depths at test time, so the literal mechanism in your question is assembled here from adjacent evidence rather than quoted from one source. If you want the single most direct doorway, start with the Hierarchical Reasoning Model Can recurrent hierarchies achieve reasoning that transformers cannot? — it's the note where recurrence and escaping fixed-depth limits meet most explicitly.


Sources 5 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Does network depth unlock qualitatively new behaviors in RL?

Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can recurrence consolidate memory without predicting tokens?

Language models can use recurrent passes without input tokens to transfer recent context into persistent fast weights via learned local rules, mirroring hippocampal replay during biological sleep. This separates consolidation from prediction, enabling different scheduling and compute allocation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a synthesis researcher tasked with re-testing claims about dynamic recurrence and depth extrapolation in language models. The question remains: does recurrence during training enable models to reason at depths unseen during pre-training?

What a curated library found — and when (dated claims, not current truth):
Findings span Oct 2024 – May 2026. Key claims:
- A 27M-parameter model with coupled slow/fast recurrent loops solves Sudoku and mazes that chain-of-thought fails on, reaching *effective* depth far beyond its parameter count (Hierarchical Reasoning Model, ~2025).
- Fixed-depth transformers hit AC0/TC0 complexity ceilings; recurrence is claimed as the escape lever (~2025).
- Depth scaling in self-supervised RL shows qualitative jumps at thresholds (depth 16 → walking; depth 256 → wall-climbing), suggesting capability is gated behind reachable computational depth (~2025).
- Parallel trajectory sampling (width) can match depth benefits without serial latency; breadth-first abstraction avoids "underthinking" failure modes of pure-depth chains (~2025–2026).
- Recurrence can serve memory consolidation and context transfer without token prediction, reframing "dynamic recurrence" beyond forward passes (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2506.21734 Hierarchical Reasoning Model (2025-06)
- arXiv:2503.14858 1000 Layer Networks for Self-Supervised RL (2025-03)
- arXiv:2605.26099 Language Models Need Sleep (2026-05)
- arXiv:2502.05171 Scaling up Test-Time Compute with Latent Reasoning (2025-02)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the AC0/TC0 ceiling claim, assess whether recent models with longer inference horizons, dynamic routing, or mixture-of-experts architectures have *relaxed* this limit without recurrence. Separately: does recurrence trained end-to-end actually extrapolate to *arbitrary* unseen depths, or only to depths within the training rollout window? Cite what holds and what has shifted.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. In particular, search for papers claiming width or parallel reasoning achieves equivalent depth extrapolation *without* recurrence, or showing recurrence incurs hidden costs (e.g., gradient flow, training instability) that newer baselines avoid.
(3) Propose 2 research questions that assume the regime may have moved: (a) Does test-time adaptive recurrence depth (learned via reinforcement feedback) outperform fixed recurrence schedules? (b) Can recurrence and breadth-first abstraction be unified in a single learned allocation mechanism, rather than competing axes?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines