Why do energy-based models generalize better on out-of-distribution data than standard transformers?
This asks why energy-based models hold up better than transformers when tested outside their training distribution — but first, an honest flag: the corpus has no notes on energy-based models, so I can't speak to that comparison directly. What it does have is a sharp account of *why transformers themselves struggle out-of-distribution*, which is the more answerable half of your question.
This explores out-of-distribution generalization, and the corpus answers the transformer side cleanly even though it's silent on energy-based models specifically — so treat what follows as "here's what makes OOD hard for transformers," which is the territory any energy-based comparison would have to beat.
The deepest finding is that transformers often look like they're reasoning when they're actually pattern-matching. One study shows compositional reasoning in transformers collapses into "linearized subgraph matching" — the model memorizes computation paths it saw in training and stitches them together, succeeding in-distribution but failing badly on novel combinations, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. A companion result probes models trained on orbital mechanics and games and finds they build task-specific heuristics, not unified world models — arithmetic, for instance, runs on "range-matching" tricks rather than an actual algorithm Do foundation models learn world models or task-specific shortcuts?. If your internal representation is a bag of shortcuts tuned to the training slice, distribution shift is exactly where it breaks.
There's a subtler reason the breakage is hard to see coming. One note shows that a model can carry every linearly-decodable feature a task needs — scoring perfectly on standard evals — while its internal organization is "fractured," leaving it quietly vulnerable to perturbation and distribution shift that the metrics never flag Can models be smart without organized internal structure?. So good in-distribution accuracy isn't evidence of OOD robustness; the two can fully decouple. And on genuine constrained-optimization tasks, transformers plateau around 55–60% regardless of scale or architecture, which reads as a structural ceiling rather than a problem more parameters would fix Do larger language models solve constrained optimization better?.
The one place the corpus shows transformers *winning* OOD is instructive about what it takes: a self-improving setup gets them from 10-digit to 100-digit addition by repeatedly generating correct solutions, filtering, and retraining — earning exponential length generalization, but only through an external correctness signal and an iterative loop, not from the architecture alone Can transformers improve exponentially by learning from their own correct solutions?. That's the hidden punchline for your question: where transformers do generalize out-of-distribution, it tends to come from an added training procedure or a verifier, not from the base model spontaneously extrapolating.
What you'd want next — and what isn't here — is the energy-based side: the claim that learning an energy landscape over inputs (rather than a feed-forward map) lets a model evaluate and reject configurations it never saw. The corpus can tell you *why the transformer baseline is weak OOD* but can't yet tell you *why an EBM beats it*. If that's the real target, this is a gap worth flagging for the collection rather than one I can paper over.
Sources 5 notes
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Inductive bias probes show transformers trained on orbital mechanics and games learn predictive patterns, not unified world structure. Fine-tuning reveals nonsensical, slice-dependent laws; circuit analysis shows arithmetic relies on range-matching heuristics, not algorithms.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.