Do sparse arithmetic circuits explain all language model reasoning abilities?
This reads the question as: can reasoning in language models be reduced to one clean internal mechanism — neat, sparse computational circuits — or does the corpus show something messier underneath?
This explores whether reasoning in language models boils down to a single tidy internal mechanism (sparse, circuit-like arithmetic) — and the short answer from this collection is no, not cleanly. The corpus doesn't treat 'reasoning' as one thing with one location. It treats it as a bundle of different behaviors that show up in different places inside the model, sometimes contradicting each other. The first warning sign comes from work showing that identical model outputs can hide radically different internal structures What actually happens inside a language model? — two models can get the same answer by entirely different internal routes. If a single sparse circuit explained reasoning, you'd expect the internals to converge. They don't.
There is a real strand here that points toward sparsity, which is probably what the question is reaching for. Under hard, unfamiliar tasks, models systematically thin out their activations — hidden states become sparser in a localized way that tracks task difficulty, acting like a selective filter that stabilizes performance rather than a breakdown Do language models sparsify their activations under difficult tasks?. And inside a chain of reasoning, models do quietly rank tokens by function, preferentially protecting the symbolic-computation tokens (the actual 'arithmetic') while pruning grammar and filler first Which tokens in reasoning chains actually matter most?. So there's a genuine kernel of structured, sparse computation in there.
But that kernel doesn't carry the whole load. When researchers strip the familiar semantics out of a reasoning task and leave only the logical structure, performance collapses — even when the correct rules are sitting right there in context Do large language models reason symbolically or semantically?. That's the opposite of what a robust arithmetic circuit would predict: a real circuit shouldn't care whether the variables are named 'apple' or 'X'. Relatedly, reasoning failures don't cluster at complexity thresholds the way an algorithm would break; they cluster at instance-novelty boundaries, suggesting models are matching against patterns they've seen rather than running a general procedure Do language models fail at reasoning due to complexity or novelty?.
The most surprising twist against the clean-circuit picture is where the computation even lives. In models trained with hidden chain-of-thought, the correct answer is computed in the earliest layers — then actively overwritten in later layers to produce format-compliant filler, with the real reasoning still recoverable from lower-ranked predictions Do transformers hide reasoning before producing filler tokens?. Reasoning here isn't one localized circuit firing cleanly to an output; it's an early computation that gets suppressed and disguised on its way out.
So the thing you might not have known you wanted to know: there is sparse, structured, almost arithmetic-like machinery inside these models — but it sits alongside semantic shortcutting, instance memorization, and layer-to-layer overwriting. 'Reasoning' is a label for several mechanisms that don't reduce to one. Sparse circuits are part of the story; they aren't the whole story, and the corpus's most interesting finding is how much of reasoning is hidden, semantic, and pattern-bound rather than symbolic and clean.
Sources 6 notes
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.