Can any architecture fundamentally solve problems that require inherently sequential computation?
This explores whether the limits on solving inherently sequential problems are about the architecture you pick (transformers vs. recurrent vs. something new) or about a deeper computational barrier no design escapes — and what the corpus says about routing around it.
This question is really two questions wearing one coat: is there a hard wall around problems that demand step-by-step computation, and if so, can any architecture climb over it? The corpus's sharpest answer comes from complexity theory. The serial scaling hypothesis argues that some problems genuinely require polynomial-depth reasoning, and that parallel architectures like the Transformer cannot solve them no matter how large you scale — width and parameters don't substitute for depth Can parallel architectures solve inherently sequential problems?. So the honest framing is: the wall is real, but it's a wall against a *class* of architectures (fixed-depth, parallel), not against computation itself.
The escape route the corpus keeps returning to is recurrence — buying serial depth back by computing in time rather than in layers. The Hierarchical Reasoning Model couples slow abstract planning with fast detailed steps across two timescales, and with only 27M parameters solves Sudoku and mazes where chain-of-thought collapses entirely, explicitly breaking the AC0/TC0 ceiling that traps fixed-depth transformers Can recurrent hierarchies achieve reasoning that transformers cannot?. Chain-of-thought itself is a softer version of the same trick: on compositional tasks like graph connectivity it earns an *exponential* advantage over parallel voting, precisely because the answer requires accumulating intermediate results in order, which short parallel chains can't fake When does sequential reasoning beat parallel voting?. The pattern: you don't beat sequential problems by changing the substrate, you beat them by letting the system take more genuine steps.
Here's the surprise the corpus hands you. A single finite-size transformer is already Turing complete — there exists a fixed network that can compute any computable function given the right prompt Can a single transformer become universally programmable through prompts?. So in principle the architecture is not the bottleneck at all; the bottleneck is whether training ever teaches the model to *use* that latent universality. That reframes the whole debate: reasoning models persistently beat non-reasoning ones at equal inference budget not because they have more raw capacity, but because training installs a protocol that makes each extra step productive Can non-reasoning models catch up with more compute?.
The most counterintuitive corner is that you can often dodge the depth wall by re-shaping the problem rather than deepening the model. MAKER decomposes million-step tasks into minimal subtasks with voting at each step and runs them to zero errors — using *small, non-reasoning* models, inverting the assumption that hard sequential problems need bigger brains Can extreme task decomposition enable reliable execution at million-step scale?. Recursive subtask trees with cache pruning sustain reasoning past context limits Can recursive subtask trees overcome context window limits?, and Atom of Thoughts goes further, making each state depend only on the current contracted subproblem — a memoryless, Markov-style chain that keeps coherence without dragging history Can reasoning systems forget history without losing coherence?. Separating the planner from the solver helps too, since the decomposition skill generalizes across domains even when solving doesn't Does separating planning from execution improve reasoning accuracy?.
The sobering counterweight: capability in principle isn't competence in practice. Frontier reasoning models score only 20–23% on constraint-satisfaction problems that demand sustained backtracking, meaning fluent-looking reflection doesn't translate to actually solving unfamiliar sequential structure Can reasoning models actually sustain long-chain reflection?. So the full answer: no architecture *evades* the need for sequential computation, but several — recurrence, decomposition, prompted universality — supply it. The frontier isn't inventing an architecture that cheats the wall; it's getting models to reliably spend the serial steps they're already capable of taking.
Sources 10 notes
Complexity theory proves that problems requiring polynomial-depth reasoning cannot be solved by parallel architectures like Transformers, even with infinite scaling. Progress requires recurrent structures that increase serial computation depth.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.