Why does the second loop do most of the productive refinement work?
This explores why, in looped/iterative language models, the *second* pass through the computation does the heavy lifting of refinement — while the first sets up and later passes add little or even hurt.
This reads the question as being about looped and iteratively-refined models — architectures that reapply the same computation multiple times — and why the gain concentrates in the second pass rather than spreading evenly across loops. The most direct answer in the corpus comes from LoopCoder-v2, which finds looping has a *sweet spot, not a slope*: two loops deliver broad gains, but three or more regress. Loop 2 carries the productive refinement, while later loops oscillate with reduced representational diversity instead of converging toward something better Does adding more loops always improve looped language models?. So the phenomenon in the question is real — but the interesting part is *why* it stops being productive after that.
A mechanistic clue sits one note over: looped transformers don't invent new computation on each pass. Each recurrent layer settles into distinct fixed points forming stable cyclic trajectories — the model is essentially re-enacting and repeating the feedforward stages of inference in depth How do looped language models actually improve reasoning in depth?. Read together, this explains the sweet spot: the first pass produces a rough result, the second pass re-applies inference to *correct and sharpen* it, and by the third the trajectory has already converged to its fixed point — further loops just orbit it, adding noise rather than signal. Refinement happens where there's still error to remove; once the dynamics stabilize, extra depth has nothing left to do.
This is the same failure architecture that shows up in response-level iterative refinement, not just inside the network. Sequential revision methods reproduce the *overthinking* failure mode at slower timescale — they accumulate noise without any guarantee of improvement, which is why Progressive Draft Refinement deliberately compresses memory between iterations and beats longer reasoning traces at matched compute Do iterative refinement methods suffer from overthinking?. The lesson rhymes: a little iteration corrects; a lot of it drifts.
There's a tempting alternative explanation worth ruling out — maybe later loops fail because the model runs out of *diversity* to draw on. The corpus supports that this matters. When single-model refinement saturates, the productive move is to spend compute on a diverse population of models rather than refining one harder Should extra compute refine one model or build many?. And mining *intermediate* reasoning points (rather than final ones) yields more accurate answers because it samples alternative paths before early commitment narrows the solution space Can intermediate reasoning points yield better answers than final ones?. Both reinforce the same picture from the other side: refinement is productive only while there's still representational variety to exploit, and collapses into oscillation once it's gone.
The thing you might not have expected to learn: "more iteration = better" is almost never true in these systems. Across looped transformers, draft refinement, and extended chain-of-thought, extra passes tend to produce *more text, not more computation* Do reasoning models actually beat standard models on optimization?. The second loop wins not because two is magic, but because it's the last pass that still has error to fix before the dynamics lock in — refinement is a one-correction window, not a slope you can climb indefinitely.
Sources 6 notes
LoopCoder-v2 shows that two loops deliver broad gains over baseline, but three or more loops regress. Loop 2 carries the productive refinement; later loops oscillate with reduced representational diversity rather than converging toward better performance.
Each recurrent layer converges to distinct fixed points forming stable cyclic trajectories. Looped models learn to mirror and repeat feedforward inference stages rather than discover new computation, emerging naturally without explicit training.
Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.
Once single-model pretraining saturates, aggregating predictions from a diverse population of models reaches lower validation loss than further refining one model. Anti-correlated learning-rate and weight-decay schedules plus chain distillation enable this efficiently, matching 256-epoch ensembles with ~56 epochs.
Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.
Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.