Can energy minimization unlock reasoning without domain-specific training?
Can a gradient descent-based architecture achieve system 2 thinking across any modality or problem type using only unsupervised learning, without verifiers or reasoning-specific rewards?
Energy-Based Transformers (EBTs) represent a fundamentally different approach to inference-time scaling. Rather than generating tokens sequentially, EBTs train to assign an energy value (unnormalized probability) to every input and candidate-prediction pair. Prediction is then reframed as gradient descent-based energy minimization until convergence — the model iteratively refines its prediction by descending the energy landscape.
This formulation enables System 2 Thinking to emerge from unsupervised learning without any of the domain-specific scaffolding that current approaches require:
- No modality restrictions (works on both text and images)
- No problem-specific design (not limited to verifiable domains like math/code)
- No additional supervision beyond unsupervised pretraining (no verifiers, no verifiable rewards)
The scaling results are striking:
- Training: Up to 35% higher scaling rate than Transformer++ with respect to data, batch size, parameters, FLOPs, and depth
- Inference: 29% more improvement from additional test-time compute on language tasks than Transformer++
- Generalization: Larger performance improvements on data farther out-of-distribution — suggesting EBTs generalize better than existing approaches
- Efficiency: Outperform Diffusion Transformers on image denoising with fewer forward passes
The deeper implication: current test-time scaling approaches are constrained by their dependence on either (a) verbalized reasoning chains requiring domain-specific training data, or (b) verifiable reward signals for RL-based approaches. EBTs bypass both constraints by making "thinking harder" an inherent property of the architecture — more gradient descent iterations at inference = more thinking, with the model's own energy function as the implicit verifier.
This challenges the implicit assumption in Can non-reasoning models catch up with more compute? — EBTs are not "reasoning models" in the RL-trained sense, yet they scale with inference compute because the energy minimization framework is itself a form of iterative refinement that doesn't require explicit reasoning traces.
Inquiring lines that use this note as a source 43
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can closed-form solutions compete with gradient descent optimization?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- Why do human-designed neural architectures eventually get replaced by learned ones?
- How much does domain shift limit the mechanisms a bilevel system can autonomously discover?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- Can latent recurrence and energy minimization both escape the same computational depth constraints?
- How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
- What inductive bias would force models to learn Newtonian mechanics instead of shortcuts?
- Why do task-specific heuristics fail at generalizing to sparse data regions?
- Can gradient approximation at equilibrium replace backpropagation through time in practice?
- Can we transfer reasoning structure without copying surface form?
- Does policy entropy collapse limit how many iterations of reasoning training work?
- Can targeted activation steering surface latent reasoning in base models?
- How does factoring perception from reasoning improve sparse-label learning?
- Can explicit optimal algorithms prevent reasoning model collapse at high complexity?
- Why do recursive belief models require different training than logical derivation?
- How do beam search and MCTS traverse reasoning topologies?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- How does soft thinking compare to sampling multiple independent reasoning paths?
- Can a single architecture represent both physical and mental possibility spaces?
- What makes thought identifiability provable without auxiliary training data?
- Why is metacognition neglected as a foundational AI research area?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- Why do reasoning models fail to improve constrained optimization performance?
- Do base models contain latent reasoning that minimal training can unlock?
- How does policy initialization with sub-policies enable emergent thinking?
- Can one training example activate mathematical reasoning without reinforcement learning?
- Does decoupling reasoning reduce inference cost more than sequential scaling?
- Can models reason at inference without specialized internal training?
- How do classical mechanics and statistical mechanics provide methodological templates for learning theory?
- How do soft token mixtures enable parallel reasoning exploration without explicit training?
- Can deterministic recurrent depth achieve the computational benefits of stochastic reasoning?
- How much training data is truly necessary to unlock latent model reasoning?
- Can reasoning happen in latent space without chain of thought?
- Can energy-based transformers achieve deep reasoning without supervision?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- Can distillation from stronger models create genuinely new reasoning abilities?
- How can neural networks be interpretable by design rather than post-hoc?
- How do compact latent dynamics enable planning without explicit chain of thought?
- Can minimal training signals unlock latent reasoning capability in base models?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- Can small demonstration sets unlock general reasoning without large question data?
- Why does architecture matter more than training compute for inference efficiency?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
EBTs operationalize this at the architecture level: energy minimization inherently scales with inference compute
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
EBTs may redefine the boundary: energy minimization is a form of inference-time computation that doesn't require reasoning-specific RL training
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
EBTs add nuance: for energy-based architectures, more iterations genuinely improve until convergence, unlike token-based reasoning where overthinking degrades quality
-
Can recurrent hierarchies achieve reasoning that transformers cannot?
Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.
complementary latent architecture: HRM achieves near-perfect accuracy on tasks where CoT scores 0% via dual-recurrence; EBTs achieve 35% higher scaling rate via energy minimization; different mechanisms (recurrence vs. gradient descent) escaping the same TC0 constraint
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Energy-Based Transformers are Scalable Learners and Thinkers
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
- Hierarchical Reasoning Model
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Can Large Language Models Reason and Optimize Under Constraints?
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
- A Mechanistic Analysis of Looped Reasoning Language Models
Original note title
energy-based transformers achieve system 2 thinking from unsupervised learning alone — modality and problem agnostic