What makes looped latent computation more efficient than scaling attention capacity?
This explores why running computation in a model's own latent space — feeding hidden states back through the same weights — buys more capability per unit of compute than simply widening the attention window or growing the attention matrix.
This explores why looping a model through its own latent representations often beats the brute-force alternative of giving attention more to chew on. The short version the corpus keeps circling: attention pays a quadratic tax to relate every token to every other token, and most of that capacity is wasted, whereas looped latent computation reuses the same weights to *deepen* processing and *compress* what matters into internal state.
The sharpest reframing comes from work arguing the long-context bottleneck was never really about memory at all — it's the *compute* needed to transform raw context into a usable internal state Is long-context bottleneck really about memory or compute?. Scaling attention capacity attacks the wrong constraint: it grows the place you store tokens, not the work of digesting them. A feedback loop attacks the right one. TransformerFAM shows a model can attend to its *own* latents to build working memory for indefinitely long inputs — and crucially, it adds no new weights, so the gain is pure compute-reuse rather than parameter inflation Can models learn working memory by attending to their own latents?.
There's also a deeper efficiency argument at the learning level. Predicting your own latents is *exponentially* more sample-efficient than predicting tokens, because same-level latent representations are far more correlated than raw tokens — so compositional structure gets recovered with a number of samples that stays flat as hierarchy depth grows, instead of exploding Why is predicting latents more sample-efficient than tokens?. Latent computation is operating on an already-distilled signal, while attention over tokens is wading through redundancy. Latent-Thought models lean into this by opening a scaling dimension that's independent of parameter count entirely — you scale the latent budget rather than the weights Can latent thought vectors scale language models beyond parameters?.
The interesting twist is that 'looping' doesn't only mean going deeper serially. GRAM shows you can scale latent reasoning in *width* — sampling parallel latent trajectories — to get the benefits of more computation without paying the serial latency of stacking ever more depth Can reasoning systems scale wider instead of only deeper?. That complements the classic finding that depth itself composes abstraction more efficiently than width for small models, by layering concepts rather than spreading parameters thin Does depth matter more than width for tiny language models?. And there's evidence the latent space is naturally suited to this kind of work: hidden states *sparsify* adaptively under hard, out-of-distribution tasks, behaving like a selective filter rather than a failure — the model is already routing compute to what matters Do language models sparsify their activations under difficult tasks?.
The contrast case is Titans, which doesn't dispute any of this so much as split the difference: keep attention for short-term, quadratic work, but offload long-term memory to a separate compressed module that stores only *surprising* tokens Can neural memory modules scale language models beyond attention limits?. That's the whole thesis in miniature — the win isn't a bigger attention matrix, it's moving the heavy lifting into a compressed, reused internal representation. Worth knowing as a footnote: even within plain attention, a handful of input-agnostic 'massive activations' quietly do the work of biasing attention, hinting that raw attention capacity is doing far less than its size suggests Do hidden massive activations act as attention bias terms?.
Sources 9 notes
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.
A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.