SYNTHESIS NOTE

Can energy minimization unlock reasoning without domain-specific training?

Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Energy-Based Transformers (EBTs) represent a fundamentally different approach to inference-time scaling. Rather than generating tokens sequentially, EBTs train to assign an energy value (unnormalized probability) to every input and candidate-prediction pair. Prediction is then reframed as gradient descent-based energy minimization until convergence — the model iteratively refines its prediction by descending the energy landscape.

This formulation enables System 2 Thinking to emerge from unsupervised learning without any of the domain-specific scaffolding that current approaches require:

No modality restrictions (works on both text and images)
No problem-specific design (not limited to verifiable domains like math/code)
No additional supervision beyond unsupervised pretraining (no verifiers, no verifiable rewards)

The scaling results are striking:

Training: Up to 35% higher scaling rate than Transformer++ with respect to data, batch size, parameters, FLOPs, and depth
Inference: 29% more improvement from additional test-time compute on language tasks than Transformer++
Generalization: Larger performance improvements on data farther out-of-distribution — suggesting EBTs generalize better than existing approaches
Efficiency: Outperform Diffusion Transformers on image denoising with fewer forward passes

The deeper implication: current test-time scaling approaches are constrained by their dependence on either (a) verbalized reasoning chains requiring domain-specific training data, or (b) verifiable reward signals for RL-based approaches. EBTs bypass both constraints by making "thinking harder" an inherent property of the architecture — more gradient descent iterations at inference = more thinking, with the model's own energy function as the implicit verifier.

This challenges the implicit assumption in Can non-reasoning models catch up with more compute? — EBTs are not "reasoning models" in the RL-trained sense, yet they scale with inference compute because the energy minimization framework is itself a form of iterative refinement that doesn't require explicit reasoning traces.

Inquiring lines that read this note 46

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Which computational strategies best support reasoning in language models?

Can closed-form solutions compete with gradient descent optimization?

How effectively do deterministic tools improve language model reasoning on formal tasks?

Can surface heuristics override implicit constraints in domain-specific reasoning?

Do autonomous architecture discoveries follow predictable scaling laws?

Why do human-designed neural architectures eventually get replaced by learned ones?

Why do self-improving systems struggle without clear external performance metrics?

How much does domain shift limit the mechanisms a bilevel system can autonomously discover?

Does reinforcement learning teach reasoning or just when to reason?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

How does latent reasoning compare to verbalized chain-of-thought?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How does example difficulty affect learning efficiency in language models?

Why do task-specific heuristics fail at generalizing to sparse data regions?

How do training data properties shape reasoning capability development?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Does policy entropy collapse limit how many iterations of reasoning training work?

Do base models contain latent reasoning that training can unlock?

How does sequence length affect sparsity tolerance in models?

Why do reasoning models fail at systematic problem-solving and search?

Can explicit optimal algorithms prevent reasoning model collapse at high complexity?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do recursive belief models require different training than logical derivation?

How does reasoning graph topology affect breakthrough insights and generalization?

What capability tradeoffs emerge when scaling model reasoning abilities?

How do soft continuous representations explore multiple reasoning paths simultaneously?

How does AI assistance affect human cognitive development and reasoning autonomy?

Why is metacognition neglected as a foundational AI research area?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Does decoupling reasoning reduce inference cost more than sequential scaling?

What limits mechanistic interpretability's ability to characterize models?

Can inference-time compute substitute for scaling up model parameters?

Why does architecture matter more than training compute for inference efficiency?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 148 in 2-hop network ·dense cluster Open in graph ↗

Can energy minimization unlock reasoning without… Can inference compute replace scaling up model siz… Can non-reasoning models catch up with more comput… Does more thinking time actually improve LLM reaso… Can recurrent hierarchies achieve reasoning that t…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
EBTs operationalize this at the architecture level: energy minimization inherently scales with inference compute
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
EBTs may redefine the boundary: energy minimization is a form of inference-time computation that doesn't require reasoning-specific RL training
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
EBTs add nuance: for energy-based architectures, more iterations genuinely improve until convergence, unlike token-based reasoning where overthinking degrades quality
Can recurrent hierarchies achieve reasoning that transformers cannot? Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
complementary latent architecture: HRM achieves near-perfect accuracy on tasks where CoT scores 0% via dual-recurrence; EBTs achieve 35% higher scaling rate via energy minimization; different mechanisms (recurrence vs. gradient descent) escaping the same TC0 constraint

Can energy minimization unlock reasoning without domain-specific training?

Inquiring lines that read this note 46

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5