INQUIRING LINE

Do decoder-only models have inherent architectural limits for non-sequential information?

This explores whether the left-to-right, causal design of decoder-only models structurally prevents them from handling information that isn't inherently ordered — and the corpus suggests the honest answer is 'partly, but the limit is more removable than it looks.'


This reads the question as asking whether the causal, left-to-right machinery of decoder-only models is a hard wall against non-sequential information — and the corpus splits into three answers that are worth holding together. The most direct evidence says yes, there's a real architectural tax: causal attention forces every token to attend only backward, which cripples these models when they're asked to build a holistic representation of a whole input. Why do decoder-only models underperform as text encoders? pins the bottleneck precisely — it's the causal masking, not model size, that makes decoder-only models weak text encoders. But the same note is also the escape hatch: simply switching the attention to bidirectional turns those same models into state-of-the-art encoders. So the 'limit' is a configuration choice baked into how the model reads, not an irreversible property of the weights.

There's a deeper, less negotiable version of the limit when the task is reasoning rather than encoding. Can recurrent hierarchies achieve reasoning that transformers cannot? shows that fixed-depth transformers sit under a genuine complexity ceiling (the AC0/TC0 class), which is why chain-of-thought collapses on tightly interdependent puzzles like Sudoku and mazes — problems where the answer can't be assembled by sweeping left to right but requires holding the whole board in mind and iterating. A recurrent, hierarchical model escapes that with only 27M parameters. That's the strongest case that something about the standard architecture, not just its training, struggles with information whose structure is simultaneous rather than sequential.

And yet the corpus refuses to let 'inherent' stand unchallenged. Can a single transformer become universally programmable through prompts? proves a single finite transformer exists that can compute any computable function given the right prompt — meaning the architecture is not formally bounded at all. The catch is that standard training rarely produces models that actually implement such programs. This reframes the whole question: the wall you hit in practice is usually a training-regime wall wearing an architecture costume, an idea echoed by Can non-reasoning models catch up with more compute?, where what looks like a capability ceiling turns out to be about how the model was trained to deploy its compute.

Worth knowing is that the field is routing around the sequential bias rather than waiting for a new architecture. Can neural memory modules scale language models beyond attention limits? bolts a separate long-term memory onto attention so the model isn't forced to re-derive everything from a linear sweep, and Is long-context bottleneck really about memory or compute? argues the real long-context constraint is the compute to consolidate scattered context into internal state — again a processing limit, not a storage one. Step back further and Are text-only language models fundamentally limited by abstraction? suggests the most stubborn 'non-sequential' problem isn't even attention shape: text itself has already flattened the geometry, physics, and causal structure of the world into a symbol stream, so the model never receives the non-sequential information in the first place.

The thing you didn't know you wanted to know: 'decoder-only' bundles together at least three separable limits — a removable encoding limit (causal masking), a real but escapable reasoning-depth limit (fixed-depth complexity classes), and a non-limit in principle that becomes a limit only through how we train. The architecture is less destiny than it appears; what's sequential is mostly the habit, not the hardware.


Sources 7 notes

Why do decoder-only models underperform as text encoders?

LLM2Vec's unsupervised 3-step process (bidirectional attention + masked prediction + contrastive learning) achieves SOTA on MTEB. The research shows causal masking, not model size, is the representation bottleneck in decoder-only encoders.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether decoder-only models have inherent architectural limits for non-sequential information. The question remains open; treat findings below as dated claims, not current truth.

What a curated library found — and when (2024–2026):
- Causal masking in decoder-only models cripples holistic text encoding, but switching to bidirectional attention recovers state-of-the-art performance; the 'limit' is a configuration choice, not intrinsic (2024).
- Fixed-depth transformers sit under a genuine complexity ceiling (AC0/TC0 class); chain-of-thought fails on tightly interdependent puzzles like Sudoku without recurrence or hierarchy; a 27M hierarchical recurrent model escapes this (2025–2026).
- A finite transformer can compute any computable function given the right prompt (Turing completeness); standard training, not architecture, is the bottleneck in practice (2024).
- Long-context limits stem from compute needed to consolidate scattered context into internal state, not from sequential-read geometry (2025).
- Text itself is a lossy flattening of non-sequential structure; the model may never receive the raw non-sequential information at all (2024).

Anchor papers (verify; mind their dates):
- arXiv:2404.05961 (LLM2Vec, Apr 2024) — decoder-only as bidirectional encoder
- arXiv:2411.01992 (Turing completeness of prompting, Nov 2024) — formal bounds
- arXiv:2506.21734 (Hierarchical Reasoning Model, Jun 2025) — complexity class escape
- arXiv:2501.00663 (Titans, Dec 2024) — adaptive memory modules

Your task:
(1) RE-TEST EACH CONSTRAINT. For the causal-masking limit: has fine-tuning, in-context learning, or new attention variants (e.g., block-sparse, mixture-of-experts masking) since mid-2025 actually *relaxed* the need to switch to bidirectional? For the reasoning-depth limit: do newer models (o1-style or post-2025 variants) escape AC0/TC0 via training signal, or do they still require explicit recurrence? For the training-vs-architecture gap: cite what resolved it — new loss functions, curriculum, scaling laws, or architectural hybrids.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Has any paper shown decoder-only models solving Sudoku or mazes without recurrence? Has any shown bidirectional attention unnecessary for long-context after all?
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., does test-time compute (scaling inference) now let fixed-depth decoders handle fully non-sequential tasks? Do multimodal pretraining (arXiv:2603.03276) dissolve the text-flattening problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines