Can a single architecture represent both physical and mental possibility spaces?
This explores whether one reasoning architecture can hold both physical possibility (how a world could actually unfold — board states, mazes, spatial dynamics) and mental possibility (the space of beliefs, strategies, and uncertain interpretations a mind entertains), rather than needing separate machinery for each.
This explores whether one reasoning architecture can model both kinds of "what could be" — the physical (how a maze, a Sudoku board, or a world-state could resolve) and the mental (the spread of strategies, beliefs, and uncertain readings a solver holds before committing). The corpus suggests the answer is converging toward yes, but only once you stop treating prediction as a single deterministic forward pass.
The strongest single-architecture candidate is the energy-based framing. Energy-Based Transformers assign an energy value to every input-prediction pair and reach an answer by gradient-descending that landscape at inference time Can energy minimization unlock reasoning without domain-specific training?. That matters here because an energy landscape is naturally a possibility space: low-energy basins are the configurations the model finds plausible, whether those configurations describe a physical layout or a candidate line of reasoning. The same minimization machinery walks both. And because it learns this from unsupervised data without domain-specific scaffolding — generalizing better out-of-distribution — it isn't quietly two systems wearing one coat.
The physical side shows up most clearly in the Hierarchical Reasoning Model, which couples slow abstract planning with fast detailed computation and nearly perfectly solves Sudoku and mazes where chain-of-thought collapses Can recurrent hierarchies achieve reasoning that transformers cannot?. Crucially, it does this by escaping the fixed-depth ceiling of standard transformers — implying that representing rich physical possibility is less about scale than about having enough effective computational depth to simulate state forward. The mental side is supplied by GRAM, which swaps deterministic latent updates for stochastic ones so a recursive reasoner can hold a *distribution* over solutions and keep several valid strategies alive at once Can stochastic latent reasoning help models explore multiple solutions?. Put those two together and you see the shape of a single answer: depth gives you forward simulation of physical states, and stochastic latent transitions give you the branching mental space of alternatives over those states.
The interesting tension is whether one undivided network *should* carry both, and here the corpus pushes back. A recurring finding is that separating planning from execution beats monolithic models — splitting a decomposer from a solver improves accuracy, and the decomposition skill transfers across domains while the solving skill doesn't Does separating planning from execution improve reasoning accuracy?. Reasoning architectures more broadly seem to want activation-timing decoupled from execution capability How should reasoning systems actually be architected?, and abstractions that force breadth-first exploration outperform raw depth Can abstractions guide exploration better than depth alone?. So "single architecture" may be the wrong frame: the mental possibility space (which abstractions, which decomposition) and the physical one (executing a concrete state transition) keep wanting to live in different modules even when they share weights.
The deepest doorway is a skeptical one. There's an argument that computation never represents a possibility space on its own — it presupposes an experiencing mapmaker who already carved continuous physics into discrete symbols, and no amount of added complexity conjures that agent Can computation arise without a conscious mapmaker?. Read against the question, this is a warning that "physical" and "mental" possibility may not be symmetric: a model can simulate physical states it was given symbols for, but the *mental* act of deciding which possibilities are even worth representing might be something the architecture inherits from us rather than generates. Whether you find that limiting or liberating is exactly the thing you didn't know you wanted to think about.
Sources 7 notes
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
Research shows RL post-training teaches models *when* to use reasoning mechanisms that pre-training already provides. Decoupled architectures, latent reasoning in continuous space, and interleaved action-grounding all outperform monolithic chain-of-thought approaches.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Computational systems depend on a conscious mapmaker who alphabetizes continuous physics into discrete symbols. No increase in algorithmic complexity can generate this agent; it must logically precede the computation it makes possible.