INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Can non-variational posterior approximation schemes deliver comparable reasoning improvements?

This explores whether 'thinking by iterative refinement at inference time' — energy minimization, recursion, diffusion-style denoising — can match the reasoning gains of standard approaches, rather than only the variational/probabilistic methods usually framed this way.

This reads the question as asking whether inference schemes that *refine an answer through repeated passes* — instead of generating it in one forward sweep — can deliver real reasoning improvements. The corpus is unusually rich here, and the short answer is yes: several non-standard inference mechanisms not only match but exceed conventional scaling.

The clearest case is energy-based transformers, which assign an 'energy' score to each input-prediction pair and reach an answer by running gradient descent to minimize that energy at inference time Can energy minimization unlock reasoning without domain-specific training?. This is a fundamentally different way to 'think' than predicting the next token, and it produced higher training-scaling and inference-compute gains than a strong transformer baseline while generalizing better out of distribution — all from unsupervised learning, with no domain-specific scaffolding. Diffusion LLMs make a related move: their bidirectional attention lets reasoning be embedded directly into masked positions and refined *alongside* the answer, so confidence in the answer can converge early while reasoning keeps polishing — cutting compute in half without losing accuracy Can reasoning and answers be generated separately in language models?.

What ties these to the rest of the collection is a shared principle: reasoning power can come from *recursion and iterated computation* rather than parameter count. A 7-million-parameter, two-layer network that simply recurses on its own latent state hit 45% on ARC-AGI while using 0.01% of the parameters of the LLMs it beat — and the authors trace the gain to recursion itself, not scale or hierarchy Can tiny recursive networks outperform massive language models?. Looped architectures tell the same story: re-applying the same layers in recurrent depth outperforms larger feedforward networks because the loop enables state tracking and compositional generalization that scaling can't buy, with convergence signals acting as a natural stopping rule Can models learn by looping instead of growing larger?.

The interesting twist is *why* iterative-refinement schemes win, and here the corpus offers a caution. Reasoning models beat non-reasoning ones regardless of inference budget because training installs a protocol that makes the extra computation productive — raw compute alone doesn't close the gap Can non-reasoning models catch up with more compute?. And there's a ceiling: across genuine constraint-satisfaction problems, models plateau at 55–60% no matter the architecture or training regime, so no clever inference scheme is a magic key to every reasoning task Do larger language models solve constrained optimization better?. Part of why is that LLM 'reasoning' often rides on semantic association rather than symbolic manipulation — strip the familiar semantics and performance collapses even with correct rules in hand Do large language models reason symbolically or semantically?.

So the thing worth carrying away: the lift from these alternative inference schemes isn't really about which approximation math you use — it's about giving the model a way to *spend computation iteratively on the same problem*. The same lesson shows up from the opposite direction, where more capable models actually prefer *shorter* reasoning chains and RL training drifts toward brevity as skill grows Why does chain of thought accuracy eventually decline with length?. Refinement helps until the model already knows the answer — then the gain is in stopping early, which energy-based and diffusion methods get almost for free.

Sources 8 notes

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Can tiny recursive networks outperform massive language models?

A 7M-parameter two-layer network recursing on its latent reasoning state reached 45% on ARC-AGI-1, beating larger LLMs with 0.01% of their parameters. The gains come from recursion itself, not scale or hierarchical architecture.

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can non-variational posterior approximation schemes deliver comparable reasoning improvements?

Sources 8 notes

Next inquiring lines