INQUIRING LINE

Model Architecture and Internals · Training, RL, and Test-Time Scaling · Reasoning, Retrieval, and Evaluationcross-cluster

Why does reused computation outperform adding new model depth?

This explores why looping or re-applying the same layers (reused computation) often beats simply stacking more layers to make a model deeper — and what recursion buys that raw depth doesn't.

This explores why reusing the same computation — looping a model's layers back on themselves — tends to outperform just adding fresh layers to make a network deeper. The short version from the corpus: recursion lets a model *iterate* on its own intermediate state, and that iteration buys capabilities (state tracking, compositional reasoning, spending more effort on harder steps) that extra parameters alone don't deliver.

The most direct evidence comes from looped architectures. Models that re-apply the same block in a recurrent loop outperform larger feedforward networks on reasoning, because recursion enables state tracking and compositional generalization that scaling parameters can't reach Can models learn by looping instead of growing larger?. In diffusion language models, selectively looping early-middle layers matches same-size baselines with 3.3× fewer training FLOPs and beats deeper non-looped models on reasoning at a fixed parameter budget Can looping layers beat adding depth in diffusion models?. World models show the same shape even more dramatically — refining latent state through iterated computation in a shared block reaches up to 100× parameter efficiency, spending more depth only on the harder prediction steps Can looped computation replace parameter count in world models?. The unifying idea: depth used as *iteration* adapts to problem difficulty in a way that a fixed stack of unique layers can't.

What's actually happening inside the loop is revealing. Looped transformers don't invent novel computation each pass — each recurrent layer settles into stable cyclic fixed points, effectively re-enacting and repeating the feedforward inference stages the model already knows How do looped language models actually improve reasoning in depth?. So the win isn't from more representational machinery; it's from giving the model more *passes* to converge on an answer. That reframes "depth" as compute-you-can-reuse rather than parameters-you-must-add.

This connects to a broader pattern: the corpus repeatedly finds that *how* you spend compute beats *how much* raw model you have. Test-time compute can substitute for parameter scaling on hard prompts, showing pretraining and inference compute are interchangeable resources rather than independent ones Can inference compute replace scaling up model size?. And even at the architecture level, deep-and-thin sub-billion models beat wide-and-balanced ones by composing abstract concepts through successive layers — depth-as-composition again, not width Does depth matter more than width for tiny language models?. A related efficiency theme shows up in finetuning, where intervening on frozen representations beats adding new weight capacity by 10-50× Can editing hidden representations beat weight updates for finetuning? — reusing what's already there outperforms growing the model.

One caution worth carrying away: reuse isn't magic, and the *framework* matters. With recursive reasoning, ablations show that naively bolting stochasticity onto an existing model yields nothing — gains come specifically from a principled variational objective that gives the recursion something coherent to converge toward Does adding randomness alone improve recursive reasoning models?. So reused computation outperforms new depth not because looping is inherently superior, but because iteration — when properly trained to converge — extracts more reasoning out of the same parameters than a deeper one-shot pass ever could.

Sources 8 notes

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

Can looping layers beat adding depth in diffusion models?

LoopMDM matches same-size masked diffusion models with 3.3× fewer training FLOPs and exceeds deeper non-looped baselines on reasoning tasks. Reusing computation through selective early-middle layer loops proves more effective than adding depth at fixed parameter budgets.

Can looped computation replace parameter count in world models?

LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.

How do looped language models actually improve reasoning in depth?

Each recurrent layer converges to distinct fixed points forming stable cyclic trajectories. Looped models learn to mirror and repeat feedforward inference stages rather than discover new computation, emerging naturally without explicit training.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Does adding randomness alone improve recursive reasoning models?

GRAM's ablations show naive stochasticity added to existing models yields no improvement. Gains come specifically from amortized variational inference, which couples stochastic latents to a principled generative objective rather than injecting undirected noise.

Why does reused computation outperform adding new model depth?

Sources 8 notes

Next inquiring lines