LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

Paper · arXiv 2606.18023 · Published June 16, 2026
Looped Models

Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain–cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWEbench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity.

Introduction. Looped large language models (LLM) [16, 17] have emerged as a promising way to scale the effective computational depth of language models without proportionally increasing their parameter count. Instead of stacking many distinct layers, a looped large language model (LLM) repeatedly applies a shared Transformer block, allowing the same parameters to perform multiple rounds of latent-space computation [5, 9, 18]. This design is especially attractive for test-time compute scaling, enabling additional internal refinement without generating auxiliary reasoning tokens [8]. Recent work has shown that such recurrent-depth LLMs can approach deeper non-looped Transformers and improve reasoning performance as more inference-time computation is used [8, 14, 19]. Despite this promise, standard sequential looping is difficult to deploy efficiently: each additional loop requires another pass through the shared block and introduces loop-specific KV-cache states, causing both latency and memory to grow with the loop count [15].

Discussion / Conclusion. The per-loop analysis in section 4 reveals a consistent pattern: loop contributions are not uniform, and the interaction between representational gains and the cost imposed by the CLP offset shifts across loop index in a way that explains the non-monotonic performance curve. Loop 2 is the principal site of productive refinement. The second loop is the primary site of productive refinement: among the refinement loops it introduces the most coherent hidden-state update, the highest inter-loop attention divergence, and the greatest output-distribution shift, and it is where effective rank peaks. The first loop establishes the global KV cache used by all subsequent loops, so the quality of the loop-1 representations sets a ceiling on the information available to every later loop via the frozen global attention branch.