Should loop count be fixed at training time or selected at test time?
This explores whether the number of times a looped/recurrent model re-runs its computation should be baked in during training or chosen dynamically per-input at inference — and what the corpus says about who decides when to stop.
This explores whether loop count should be a fixed training-time hyperparameter or a per-input decision made at test time. The corpus leans toward a nuanced answer: there's a structural sweet spot you can learn once, but the most accurate stopping point is something a model should detect on the fly, input by input.
Start with the case for fixing it. Looped models don't reward more iterations linearly — there's a sweet spot, not a slope. Work on LoopCoder-v2 found that two loops carried almost all the productive refinement, while three or more regressed as later loops oscillated and lost representational diversity rather than converging Does adding more loops always improve looped language models?. If the optimal depth is a small, stable number, you could just hard-code it. But that sweet spot is an average over a dataset — it tells you the typical best depth, not the right depth for any single hard or easy input.
That's where test-time selection earns its keep. The idea behind iterative latent depth is precisely that harder steps deserve more computation: looped world models reach up to 100x parameter efficiency by spending more refinement passes on harder prediction steps and fewer on easy ones Can looped computation replace parameter count in world models?. A fixed loop count throws that adaptivity away — you either overspend on easy inputs or underspend on hard ones. The interesting question then becomes: how does the model know when to stop? Detecting when the latent state reaches a fixed point turns out to halt more accurately than a learned halt token, calibrating compute close to the accuracy-saturation point without any special training regime Can fixed points replace learned halt tokens in reasoning models?. So the stopping signal can be read off the computation itself at test time, rather than predicted in advance.
There's a deeper pattern here that reframes the whole question. Several notes warn that training-time convergence quietly destroys the diversity that test-time procedures depend on. RL post-training collapses onto a single dominant format within an epoch Does RL training collapse format diversity in pretrained models?, and when a model feeds into a search procedure, training for diversity beats optimizing for one answer — because entropy-collapsed policies can't reach solutions that exploration can Should training maximize diversity when models feed into search?. Loop count is the same shape of decision: committing hard at training time forecloses options that a test-time process could have explored.
The synthesis: don't think of it as fixed-versus-selected, but as a division of labor. Training should establish the looped block and its stability (and resist over-converging away the variety that makes extra passes useful), while the loop count itself is best chosen per-input at test time via a convergence signal — with the caveat that more is not better past the sweet spot, so test-time selection needs a real halting criterion, not just 'keep going.' What you didn't know you wanted to know: the better halting signal isn't a token the model learns to emit, but the geometry of its own latent state settling into a fixed point.
Sources 5 notes
LoopCoder-v2 shows that two loops deliver broad gains over baseline, but three or more loops regress. Loop 2 carries the productive refinement; later loops oscillate with reduced representational diversity rather than converging toward better performance.
LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.
FPRM shows that looped transformers halt more accurately by detecting when their latent state reaches a fixed point, calibrating compute closer to the accuracy-saturation point than learned halt tokens without requiring special training regimes.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.