Can models learn by looping instead of growing larger? · Gravity7

Can fixed points replace learned halt tokens in reasoning models?

Does stopping inference when a looped transformer's internal state stabilizes provide a better halting signal than training a dedicated token predictor? This matters for building adaptive compute without expensive special training.

Does adding more loops always improve looped language models?

Conventional wisdom treats loop count as a dial: more loops should mean better reasoning. But does the empirical evidence support monotonic gains, or is there a point where additional loops become counterproductive?

Can looped computation replace parameter count in world models?

Does iteratively refining latent states through a shared transformer block achieve comparable performance to larger models while adapting computation depth per prediction step? This matters because world models struggle with long-horizon rollout error and computational cost.

Why do transformers need explicit chain-of-thought reasoning?

Explores whether chain-of-thought is a fundamental reasoning mechanism or a workaround for architectural limitations in how transformers track evolving state across computation steps.

Can continuous thoughts have tractable likelihoods for sampling and scoring?

Most latent-reasoning methods discard the likelihood and sampling properties that made textual chain-of-thought trainable. Can normalizing flows recover those affordances in continuous thought space while preserving efficiency?

Why does latent chain-of-thought fail so easily in training?

Explores why latent reasoning is fragile compared to textual chain-of-thought, focusing on how outcome-only supervision creates gradient starvation and representational drift in learned reasoning trajectories.

Can tiny recursive networks outperform massive language models?

Can a small network that recursively refines its reasoning on a latent state match or beat billion-parameter LLMs on hard reasoning puzzles? This challenges assumptions about scale and hierarchy in AI reasoning.

How do looped language models actually improve reasoning in depth?

Mechanistic analysis investigates whether looping transformer layers creates genuinely new computation or reuses existing inferential stages. Understanding this distinction clarifies why recurrent depth can match standard scaling.

Can reasoning be learned during pretraining rather than after?

Does building iterative computation into the pretraining phase itself allow language models to develop reasoning before post-hoc fine-tuning? And if so, does latent reasoning align better with outputs than explicit chain-of-thought?

Can looped transformers generalize to unseen knowledge combinations?

Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Can stochastic latent reasoning let models explore multiple solutions?

When recursive reasoning models collapse to single deterministic paths, can introducing stochasticity into latent transitions instead let them maintain uncertainty and consider alternative strategies? This matters because real problems often have multiple valid answers.

Can models treat long prompts as external code environments?

Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?