How much of chain-of-thought reasoning is actually redundant?
This explores how much of a chain-of-thought (CoT) trace does real computational work versus just filling space — and what the corpus says about cutting the slack.
This explores how much of a chain-of-thought trace is actually doing the work of getting to the answer, versus padding that could be removed without hurting accuracy. The corpus is surprisingly blunt on this: a lot of it is redundant. Chain of Draft hits the same accuracy as standard CoT on arithmetic, symbolic, and commonsense tasks while using only 7.6% of the tokens — meaning roughly 92% of a normal reasoning trace served style and documentation, not computation Can minimal reasoning chains match full explanations?. Other methods converge on similar numbers from different angles: dynamically pruning low-attention steps removes about 75% of them while holding accuracy, because verification and backtracking steps barely get attended to downstream Can reasoning steps be dynamically pruned without losing accuracy?, and probes that detect when a model has already 'committed' to an answer let you exit early and cut up to 80% of tokens Does chain-of-thought reasoning reflect genuine thinking or performance?.
But the more interesting twist is that 'redundant' isn't a fixed fraction — it depends on how hard the task is. The same early-exit work shows CoT is performative on easy problems (the model knows the answer before it finishes typing the reasoning) but genuinely tracks belief updates on hard ones Does chain-of-thought reasoning reflect genuine thinking or performance?. That matches the inverted-U finding: optimal CoT length rises with task difficulty but falls as the model gets more capable, and RL training naturally drifts toward shorter chains as models improve Why does chain of thought accuracy eventually decline with length?. So redundancy is partly a symptom of mismatch — a strong model on an easy task is mostly producing ceremony.
Not all the excess is harmless padding, though. A second category of waste is reasoning that actively goes nowhere: models 'underthink' by abandoning promising paths mid-exploration, and 'wander' through invalid branches like tourists rather than scientists. A simple decoding penalty on thought-switching tokens recovers accuracy without any retraining — evidence the wasted tokens were structural disorganization, not necessary search Do reasoning models switch between ideas too frequently? Why do reasoning models abandon promising solution paths?. This is a different kind of redundancy than verbose phrasing: it's effort spent and then thrown away.
Here's the thing you didn't know you wanted to know: the redundancy question quietly reframes what CoT *is*. If invalid reasoning prompts work as well as valid ones and format matters 7.5× more than logical content, then much of the chain isn't a computation you can trim line by line — it's pattern-guided generation where the structure, not the propositions, carries the load What makes chain-of-thought reasoning actually work? Does chain-of-thought reasoning reveal genuine inference or pattern matching?. And separate work argues the tokens that *do* matter are a workaround for a hardware limitation: transformers lack recurrent state-tracking, so they externalize evolving state into text Why do transformers need explicit chain-of-thought reasoning?. Put those together and 'how much is redundant' splits into two answers — most of the *prose* is removable, but the genuinely load-bearing tokens are themselves an inefficient patch for something the architecture can't do internally. Worth pairing with the faithfulness critique, which warns that steps looking essential often aren't causally necessary at all Do language models actually use their reasoning steps?.
Sources 10 notes
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Feedforward transformers lack native recurrent state-tracking and must push evolving state deeper into layers, eventually exhausting depth. Explicit chain-of-thought externalizes this state into tokens as a costly patch for a structural deficiency.
LLM reasoning chains fail both causal sufficiency (steps don't always matter) and causal necessity (spurious steps are common). Research shows most CoT evaluation measures output quality, not whether reasoning actually caused the answer.