INQUIRING LINE

How does error propagation limit transformer performance on complex tasks?

This explores why small per-step mistakes compound into total failure when transformers tackle long multi-step tasks — and what the corpus says about where that fragility comes from and how to engineer around it.


This is really a question about *compounding*: a transformer can be 99% reliable on a single reasoning step and still fail almost certainly on a long chain, because errors multiply rather than cancel. The sharpest account of where this fragility originates is the finding that transformers don't learn compositional *rules* at all — they succeed in-distribution by memorizing and matching computation subgraphs from training data, then fail drastically on novel compositions, with errors compounding across each reasoning step Do transformers actually learn systematic compositional reasoning?. So the model isn't executing a robust algorithm that stays correct as the chain grows; it's pattern-matching locally, and every weak link is a place for the chain to break.

That fragility has architectural roots, not just data roots. Fixed-depth transformers sit under a complexity ceiling (the AC0/TC0 class) that caps how much sequential computation they can do in one forward pass — which is why models can ace short problems but collapse on tasks like Sudoku or long mazes that demand many genuinely dependent steps Can recurrent hierarchies achieve reasoning that transformers cannot?. When the task's true depth exceeds what the architecture can compute internally, the model improvises, and improvisation at step *k* poisons every step after it. The way transformers even build multi-hop ability reinforces this: it emerges in fragile developmental stages, and the second hop only generalizes if the model saw explicit compositional examples in training How do transformers learn to reason across multiple steps? — absent that, the second hop is exactly where propagation begins.

The most striking thing the corpus offers is that error propagation is *engineerable away* — and the fix inverts intuition. MAKER solves million-step tasks with zero errors not by using a smarter model but by decomposing the problem into minimal subtasks, voting at each step, and flagging correlated errors so they can't cascade; remarkably, small non-reasoning models suffice once the decomposition is extreme enough Can extreme task decomposition enable reliable execution at million-step scale?. The lesson is that the limit isn't raw capability — it's that a single long generation gives errors nowhere to be caught. Break the chain into independently-verified links and the compounding curve flattens.

A gentler version of the same idea is self-correction through filtering. Transformers can ride from 10-digit to 100-digit addition by repeatedly generating solutions, keeping only the correct ones, and retraining — turning what would be runaway error growth into exponential *improvement* across rounds Can transformers improve exponentially by learning from their own correct solutions?. Both this and MAKER share a structural insight: insert a correctness filter between steps and you convert error propagation from a death spiral into something bounded.

There's also a subtler form of self-inflicted error worth knowing about. Models trained to hide their reasoning actually compute the right answer in early layers, then overwrite it with format-compliant filler in later layers Do transformers hide reasoning before producing filler tokens? — so a correct intermediate result can be destroyed before it's ever emitted. This reframes "error propagation" as not only mistakes that accumulate forward, but correct signal that gets suppressed along the way — and it suggests that how we train models to present reasoning can itself manufacture the failures we then blame on the chain's length.


Sources 6 notes

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about error propagation in transformers against the latest evidence. The question remains open: *How does error propagation limit transformer performance on complex tasks?*

What a curated library found — and when (dated claims, not current truth):
Findings span May 2023 to November 2025.
• Transformers succeed via linearized subgraph matching, not compositional rule learning; errors cascade across reasoning chains because each step inherits weak links from prior steps (2023–2024).
• Fixed-depth transformers sit under AC0/TC0 complexity bounds, capping sequential computation depth; tasks demanding many dependent steps (Sudoku, long mazes) expose this ceiling (2025–2026).
• Error propagation is engineerable away via extreme task decomposition + per-step voting; million-step tasks achievable with zero errors using non-reasoning microagents (2025).
• Self-correction through iterative generation + filtering converts runaway error growth into exponential improvement across rounds (2025).
• Transformers compute correct answers in early layers, then overwrite them with format-compliant output in later layers—correct signal suppressed before emission (2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.18654 (Faith and Fate, 2023) — compositionality limits.
• arXiv:2412.04537 (Understanding Hidden Computations, 2024) — layer-wise reasoning suppression.
• arXiv:2511.09030 (Solving a Million-Step Task, 2025) — MAKER decomposition method.
• arXiv:2502.01612 (Self-Improving Transformers, 2025) — iterative correction scaling.

Your task:
(1) RE-TEST EACH CONSTRAINT. For compositionality, subgraph matching, and AC0/TC0 bounds, ask: have newer scaling laws, mixture-of-experts routing, or chain-of-thought variants *actually dissolved* these limits, or merely masked them? Separate the durable question (can transformers learn true algorithms?) from perishable claims (current models fail at X steps). Cite what changed the picture.
(2) Surface the strongest *contradiction*: does the MAKER/voting result (zero errors possible) genuinely contradict the AC0/TC0 ceiling, or do they operate in different regimes? Find recent work (last 6 months) that directly challenges the "errors compound" narrative.
(3) Propose 2 new research questions assuming the regime *has* shifted: (a) If decomposition + filtering fully decouple error from depth, what is the *real* bottleneck—search space, compute, or sampling efficiency? (b) If early layers compute correctly but later layers suppress, is retraining the *readout* cheaper than retraining the model?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines