How much does chain-of-thought reasoning narrow the decompression gap?
This reads 'decompression gap' as the distance between what a model computes internally and what it actually works out by writing reasoning down step-by-step — and asks how much of that visible chain is doing real computational work versus just decorating an answer the model already had.
This explores whether spelling out reasoning actually unpacks hidden computation, or whether the written chain is mostly surface. The corpus is surprisingly blunt: most of what looks like decompression isn't. When researchers strip a chain-of-thought down to its computational core, accuracy barely moves — Chain of Draft matches full verbose reasoning at 7.6% of the tokens, meaning roughly 92% of the words served style and documentation rather than thinking Can minimal reasoning chains match full explanations?. A separate intervention study reaches the same place from a different angle: 75% of reasoning steps can be dynamically pruned with accuracy intact, because verification and backtracking steps receive almost no downstream attention from the model itself Can reasoning steps be dynamically pruned without losing accuracy?. So the gap CoT narrows is thinner than its length suggests.
Why so thin? Two notes argue the mechanism is imitation, not inference — CoT guides the model to pattern-match the *shape* of reasoning rather than perform genuine logic, which is why structurally valid-looking but invalid prompts still 'work' and why format dominates content Why does chain-of-thought reasoning fail in predictable ways? What makes chain-of-thought reasoning actually work?. A shift-cipher decomposition makes this concrete: CoT performance is really three independent forces — raw output probability (which alone swings accuracy from 26% to 70%), memorization tracking pre-training frequency, and a genuine-but-noisy reasoning channel that accumulates error at every step What three separate factors drive chain-of-thought performance?. Only that third channel is true decompression; the other two are the model retrieving compressed answers, not unpacking them.
The length of a chain turns out to be a poor proxy for how much real work is happening. Controlled maze experiments show trace length tracks problem difficulty only on familiar (in-distribution) problems and decouples entirely on novel ones — long traces often signal recall of a training schema, not adaptive computation Does longer reasoning actually mean harder problems?. And more reasoning isn't monotonically better: optimal length follows an inverted-U, with capable models preferring *shorter* chains as RL training pushes them toward simplicity Why does chain of thought accuracy eventually decline with length?. The decompression gap, in other words, can be narrowed by writing less, not more.
There's also a faithfulness trap worth knowing about: fine-tuning makes the gap *look* narrower while actually widening it. Three separate tests — early termination, paraphrasing, filler substitution — show fine-tuned models produce chains that less reliably determine their own answers. The reasoning becomes performative, a justification generated alongside an answer rather than the cause of it Does fine-tuning disconnect reasoning steps from final answers?. This is the uncomfortable core of your question: a chain can narrow the *appearance* gap while the *computational* gap stays open.
Where CoT genuinely closes distance is when steps are forced to carry real information rather than recall it. Interleaving reasoning with external tool calls (ReAct) injects fresh evidence at each step and beats pure CoT by 10–34% on knowledge-intensive tasks, precisely because grounding stops error from compounding Can interleaving reasoning with real-world feedback prevent hallucination?. Allocating compute to diverse abstractions rather than one deep chain forces structured breadth and avoids the 'underthinking' failure of depth-only reasoning Can abstractions guide exploration better than depth alone?. The takeaway you didn't know you wanted: chain-of-thought narrows the decompression gap mostly when its steps are load-bearing — externally grounded, exploratory, or genuinely sequential — and barely at all when they're stylistic. The trick isn't longer reasoning; it's making each step do something the model couldn't already retrieve.
Sources 10 notes
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.