INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Why does explicit chain-of-thought work as a workaround for feedforward transformers?

This explores why writing reasoning out as tokens (chain-of-thought) compensates for something the transformer's fixed-depth, feedforward architecture can't do on its own.

This explores why writing reasoning out as tokens (chain-of-thought) compensates for something the transformer's fixed-depth, feedforward architecture can't do on its own. The cleanest answer in the corpus is structural: a feedforward transformer has no native recurrent state to carry an evolving computation forward, so it must push that state deeper into its stack of layers — and a network only has so many layers. Once depth runs out, the computation has nowhere to go. Chain-of-thought sidesteps this by spilling the intermediate state out into the text the model just generated, then reading it back in on the next pass. The token stream becomes external memory, and each generated step buys a fresh trip through the full depth of the network. In other words, CoT isn't reasoning the architecture wants to do — it's a workaround for a topological deficiency in what fixed-depth attention can compute Why do transformers need explicit chain-of-thought reasoning?.

There's direct mechanistic evidence for the 'running out of depth' story. When models are trained to hide their reasoning, logit-lens analysis shows them computing the correct answer in the very earliest layers, then actively overwriting those representations in later layers to emit format-compliant filler — the real work is done early and recoverable from lower-ranked predictions Do transformers hide reasoning before producing filler tokens?. That's the same depth budget at work: the network does what it can within its layers, and externalizing to tokens is how it extends the budget. The complexity-theory framing makes the ceiling explicit — fixed-depth transformers sit inside the AC0/TC0 class, and a tiny recurrent model that loops on its own latent state solves Sudoku and mazes where chain-of-thought collapses, precisely because recurrence gives it the effective depth CoT has to fake one token at a time Can recurrent hierarchies achieve reasoning that transformers cannot?.

But 'workaround' cuts two ways, and here the corpus pushes back on taking CoT at face value. Several notes argue that what looks like reasoning is closer to imitation: CoT constrains the model to replay reasoning-shaped patterns it saw in training, and performance degrades predictably under distribution shift — the signature of pattern matching, not inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Even more bluntly, the *form* of the chain matters more than its content — training format shapes strategy far more than domain, and logically invalid CoT prompts can work as well as valid ones What makes chain-of-thought reasoning actually work?. And when transformers are probed on genuine composition, they turn out to be matching memorized computation subgraphs rather than applying systematic rules, failing hard on novel combinations Do transformers actually learn systematic compositional reasoning?. So CoT works as a workaround not because it unlocks new reasoning, but because it lets the model stage familiar pattern-completion across many forward passes instead of one.

The workaround is also leaky and has costs. Lipschitz analysis shows longer chains dampen but never eliminate the model's sensitivity to input noise — there's a structural robustness floor you can't reason your way past Can longer reasoning chains eliminate model sensitivity to input noise?. And longer isn't free: accuracy follows an inverted-U in chain length, with the sweet spot shrinking as models get more capable, so a stronger model needs less of the crutch Why does chain of thought accuracy eventually decline with length?. That last point is the tell that CoT is compensatory — the better the underlying computation, the less external scaffolding it needs.

If you want to see where this is heading, the interesting frontier is making the workaround unnecessary or internalizing it. One line tries to plant reasoning earlier by treating CoT as exploratory action during pretraining with an information-gain reward Can chain-of-thought reasoning be learned during pretraining itself?; another teaches models to internalize search algorithms by training on linearized algorithm traces so the search happens inside rather than on the page Can models learn to internalize search algorithms through training?. The cautionary note is that moving reasoning *into* latent space is hard — outcome-only supervision starves the gradients along latent steps and lets the latent state drift, which is part of why externalizing to readable tokens remains the path of least resistance for now Why does latent chain-of-thought fail so easily in training?.

Sources 11 notes

Why do transformers need explicit chain-of-thought reasoning?

Feedforward transformers lack native recurrent state-tracking and must push evolving state deeper into layers, eventually exhausting depth. Explicit chain-of-thought externalizes this state into tokens as a costly patch for a structural deficiency.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can models learn to internalize search algorithms through training?

Meta-CoT demonstrates that instruction-tuning on linearized MCTS and A* traces teaches models to implement search strategies internally. This enables optimization over algorithms themselves rather than specific outputs, potentially unlocking novel reasoning strategies.

Why does latent chain-of-thought fail so easily in training?

Outcome supervision alone causes gradient attenuation along latent steps and lets the latent space wander without semantic grounding. Robust latent reasoning requires both dense trajectory supervision and space supervision that preserves geometric structure rather than compressing it.

Why does explicit chain-of-thought work as a workaround for feedforward transformers?

Sources 11 notes

Next inquiring lines