Can concise reasoning traces match verbose explanation accuracy?
This explores whether stripped-down reasoning traces (fewer words, less explanation) can hit the same accuracy as long, verbose chains-of-thought — and the corpus has a surprisingly strong answer: yes, because most of the words were never doing the computing.
This explores whether concise reasoning traces can match verbose ones on accuracy — and the corpus suggests the verbosity was mostly decoration all along. The most direct evidence is Chain of Draft, which matches standard chain-of-thought accuracy on arithmetic, symbolic, and commonsense tasks while spending just 7.6% of the tokens Can minimal reasoning chains match full explanations?. The striking reframe there: the 92.4% of tokens you can cut were serving style and documentation, not the actual reasoning. So the question isn't really 'can shorter work' — it's 'what were the extra words ever for?'
The corpus answers that by attacking the assumption that traces contain meaningful reasoning at all. Models trained on deliberately corrupted, irrelevant traces hold their accuracy — and sometimes generalize better out-of-distribution — which implies traces act as computational scaffolding rather than logical steps Do reasoning traces need to be semantically correct?. The same theme runs through findings that invalid chain-of-thought prompts work nearly as well as valid ones, that format shapes performance far more than logical content, and that traces are persuasive appearances rather than faithful records of computation What makes chain-of-thought reasoning actually work?, Do reasoning traces show how models actually think?. If semantic correctness isn't what's buying accuracy, then trimming prose for brevity shouldn't cost accuracy either. That's the deeper reason concise traces hold up.
But 'concise' isn't the same as 'short everywhere,' and here the corpus adds nuance worth knowing. Accuracy versus length follows an inverted-U: it peaks at an intermediate length, and the optimal point rises with task difficulty but falls as models get more capable Why does chain of thought accuracy eventually decline with length?. Notably, reinforcement learning naturally pushes models toward shorter chains as they improve — brevity emerges from reward, not from being forced. So the right target isn't minimum tokens, it's the sweet spot for a given problem and a given model.
There's also a catch that argues for conciseness from the opposite direction: longer isn't free. Reasoning accuracy drops sharply as inputs get longer — from 92% to 68% with just 3,000 tokens of padding, well below the context window limit Does reasoning ability actually degrade with longer inputs?. And longer traces often signal proximity to training data rather than harder thinking Does longer reasoning actually mean harder problems?. Verbose isn't a marker of more careful reasoning; sometimes it's just recall of familiar schemas, or active drag on performance.
The thing you might not have expected: if you must cut, not all sentences are equal. Planning and backtracking sentences act as 'thought anchors' — sparse pivots that disproportionately steer everything downstream Which sentences actually steer a reasoning trace?. And step-level confidence can catch where a trace breaks down and stop early, getting majority-vote-quality results with far fewer traces Does step-level confidence outperform global averaging for trace filtering?. So the frontier isn't 'verbose vs. concise' — it's keeping the few load-bearing moves and dropping the documentation around them.
Sources 9 notes
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.