Can removing hierarchy from dual-recurrence models improve reasoning performance?
This explores whether the two-timescale 'hierarchy' in models like the Hierarchical Reasoning Model is actually doing the work — or whether the recursion alone is what improves reasoning, so hierarchy could be dropped without loss.
This explores whether the slow-planning/fast-computing hierarchy in dual-recurrence reasoning models is load-bearing, or whether the recursion itself carries the gains. The corpus has a surprisingly direct answer, and it leans toward yes. The Hierarchical Reasoning Model made the original case for hierarchy: it couples a slow abstract-planning loop with a fast detailed-computation loop across two timescales, reaching near-perfect Sudoku and maze performance with just 27M parameters where chain-of-thought collapses, by escaping the fixed-depth complexity ceiling that limits transformers Can recurrent hierarchies achieve reasoning that transformers cannot?. The natural reading is that the two-timescale structure is what buys the extra effective depth.
But a follow-up directly tests that assumption and finds the hierarchy is not the active ingredient. A 7M-parameter, two-layer network that simply recurses on its own latent reasoning state reaches 45% on ARC-AGI-1, beating billion-parameter LLMs with a fraction of a percent of their parameters — and the authors attribute the gain to recursion itself, not to scale and not to hierarchical architecture Can tiny recursive networks outperform massive language models?. In other words, strip the hierarchy, keep the recursive refinement of a latent state, and reasoning performance holds or improves. That's a clean ablation of exactly the component your question asks about.
The broader pattern in the corpus is that gains people credit to elaborate architecture often trace to one specific mechanism instead. GRAM makes the same kind of correction for stochastic recursive reasoning: adding randomness alone does nothing, and the improvement comes specifically from amortized variational inference coupling the latent to a principled objective — not from the surface feature people assume Does adding randomness alone improve recursive reasoning models?. So the lesson isn't 'simpler is always better,' it's 'isolate which part actually moves the needle.' For dual-recurrence, that part appears to be the recursive latent computation, not the hierarchy stacked on top.
There's a complementary thread suggesting why less structure can help: a lot of reasoning machinery is wasted or actively harmful. Memoryless 'Markov-style' reasoning that depends only on the current problem state, not accumulated history, maintains answer quality while shedding the historical baggage that bloats reasoning Can reasoning systems forget history without losing coherence?. Dynamic pruning removes up to 75% of reasoning steps — verification and backtracking steps that downstream computation barely attends to — without hurting accuracy Can reasoning steps be dynamically pruned without losing accuracy?. And reasoning models fail less from too little compute than from structural disorganization: wandering and premature path-switching that simple decoding penalties fix Why do reasoning models abandon promising solution paths?. Across these, extra structure is repeatedly shown to be removable or counterproductive.
The honest caveat: the corpus has exactly one head-to-head removing hierarchy specifically (the tiny recursive network vs. HRM), so this is a strong signal rather than a settled result, and it's measured on puzzle benchmarks like ARC and Sudoku — not open-ended reasoning. Worth knowing too is that these puzzle wins may rest on shakier ground than they look: reasoning success often reflects fitting instance-level patterns rather than learning a general algorithm Do language models fail at reasoning due to complexity or novelty?. So 'removing hierarchy helps' and 'the whole class is solving these the way we hope' are two separate questions — the first looks true on current evidence; the second is still open.
Sources 7 notes
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
A 7M-parameter two-layer network recursing on its latent reasoning state reached 45% on ARC-AGI-1, beating larger LLMs with 0.01% of their parameters. The gains come from recursion itself, not scale or hierarchical architecture.
GRAM's ablations show naive stochasticity added to existing models yields no improvement. Gains come specifically from amortized variational inference, which couples stochastic latents to a principled generative objective rather than injecting undirected noise.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.