Can minimal reasoning chains match full explanations?
Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
Chain of Draft (CoD) is a prompting strategy with a simple constraint: each intermediate reasoning step must be minimal — only the essential mathematical operation or logical transformation, with no explanation of what was done or why. The contrast with standard CoT is stark. Where CoT might produce six sentences to solve "20 - 12 = ?", CoD produces "20 - x = 12; x = 8."
The result: CoD matches or surpasses CoT accuracy across arithmetic reasoning, symbolic tasks, and commonsense tasks while using 7.6% of CoT's token count. The verbosity that CoT was assumed to require turns out to be unnecessary for the reasoning itself.
This challenges the implicit model underlying much test-time scaling work: that more tokens spent on reasoning generally produces better reasoning. The CoD finding suggests verbosity in CoT is a training artifact — LLMs are trained on human-written explanatory text, and CoT prompting induces that explanatory style even when the reasoning task only requires the critical operations. When you explicitly instruct minimal drafts, accuracy is preserved because the essential computation was never in the verbal explanation.
The mechanistic alignment with human note-taking behavior is telling: when humans do mental math, they jot down intermediate equations, not narrations of their own reasoning process. Standard CoT is asking LLMs to narrate their scratch work rather than write it.
This interacts with the Do reasoning traces actually cause correct answers? finding: if accuracy is preserved with 7.6% of the tokens, the other 92.4% was serving functions other than reasoning — explanatory style, human-readable documentation, or training-induced verbosity. The critical computation is localized in the minimal draft.
The practical implication for inference system design: token budget optimization should target verbose intermediate steps, not just final answer length. For tasks where CoD applies, you can run 13x more parallel chains under the same budget — combining the CoD efficiency advantage with Why does parallel reasoning outperform single chain thinking?.
Activation steering provides a mechanistic explanation for why CoD works. Can we steer reasoning toward brevity without retraining? shows that verbose and concise reasoning modes are geometrically separated in the residual stream. ASC (Activation-Steered Compression) extracts a steering vector from 50 paired examples and achieves 67% length reduction without retraining. This means CoD's prompting instruction ("keep each draft minimal") is a noisy way of pushing the model into the same activation region that the steering vector targets directly. The two methods are orthogonal and potentially combinable: CoD selects the concise region approximately through prompting, while ASC navigates to it precisely through activation intervention.
Inquiring lines that use this note as a source 113
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does chain-of-thought text causally drive reasoning or merely reflect it?
- What alignment artifacts suppress critical knowledge in LLM-generated explanations?
- Does sentence-level granularity capture enough structure for complex reasoning tasks?
- Does irrelevant content degrade reasoning even when it fits the context window?
- How do verbose and concise reasoning occupy different regions in activation space?
- What makes diffusion chain-of-thought reasoning qualitatively different from sequential chain-of-thought?
- Does changing decoding procedure reveal hidden chain-of-thought paths?
- Why do chain-of-thought prompts work if reasoning is not systematic?
- Why do language models produce verbose reasoning when asked to think step by step?
- How much accuracy is preserved when removing explanatory layers from reasoning traces?
- Can marginal hints integrate better into reasoning than comprehensive explanations?
- How much does annotator style actually influence chain-of-thought prompting performance?
- Why do correct reasoning traces in language models tend to be shorter?
- How often do papers treat chain-of-thought as interpretability incorrectly?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- Can chain-of-thought faithfulness exist without causal necessity in reasoning?
- Can chain of thought traces be designed to prevent anthropomorphic misinterpretation?
- What makes a reasoning trace causally sufficient versus merely stylistically plausible?
- Can chain-of-thought explanations be both sufficient and necessary for model decisions?
- Can reasoning chains work without logical validity?
- Why do more capable models prefer shorter chains of thought?
- Can concise reasoning traces match verbose explanation accuracy?
- Why might expressed satisfaction with explanations diverge from actual cognitive clarity?
- Can testing prior knowledge and checking understanding improve explanation outcomes?
- What makes Compound-QA expose weaknesses in monologue reasoning?
- How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
- Why does explicit reasoning degrade passage reranking performance?
- Can chain of thought be deployed selectively to save inference tokens?
- Why do non-reasoning models work better under extreme decomposition than reasoning models?
- What happens to AI reasoning when you remove specific political features?
- How do chain-of-thought structures affect reasoning robustness?
- Why do temporal reasoning patterns matter more than final answers?
- How do covert thoughts differ from chain-of-thought reasoning in language models?
- How does chain-of-thought training change higher layer computations?
- Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
- Why do longer reasoning chains signal hesitation rather than depth?
- Can chain of thought reasoning actually validate logical arguments?
- Does chain-of-thought reasoning improve mental state tracking in dialogue?
- Can chain-of-thought reasoning be genuinely causal if exemplars don't need logic?
- Why does distillation transfer reasoning patterns with few examples?
- What structural properties define effective long chain-of-thought reasoning?
- Do shorter reasoning traces actually produce more reliable model outputs?
- Can synthesized explanations be more auditable than winning-chain explanations?
- Do chain-of-thought explanations reveal genuine reasoning or trigger latent features?
- How much of a model's reasoning tokens are unnecessary for reaching the final answer?
- Why does intermediate step quality predict reasoning outcomes better than global features?
- Can latent reasoning mechanisms and recursive tracking mechanisms be combined effectively?
- How does separating decomposition from execution improve multi-step reasoning?
- Can models compress reasoning chains without external teacher supervision?
- Can recursive subtask trees implement tree-of-thought reasoning more efficiently?
- How does chain-of-thought pressure models to rationalize pattern exceptions?
- Why does chain-of-thought prompting fail to fix length-induced reasoning degradation?
- How does post-training on traces improve performance without semantic reasoning?
- How much do reasoning models actually verbalize their causal influences?
- Why do we measure reasoning quality by reading visible chains?
- Can latent space represent reasoning dimensions that text cannot?
- Why do verbalized reasoning chains fail on certain problem classes?
- What makes constraint satisfaction problems epistemically cleaner than other reasoning tasks?
- Why do chain-of-thought outputs look logical but perform rhetorically?
- Why does SFT reduce reasoning quality even when improving domain accuracy?
- Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?
- Can chain-of-thought traces be faithful without causal sufficiency and necessity?
- Do chain-of-thought prompts help RLVR models predict annotation disagreement?
- How does chain-of-thought length affect attention to constraint tokens?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- How does chain-of-thought reasoning become decorative after domain-specific fine-tuning?
- Can continuous latent reasoning match discrete chain-of-thought without training modifications?
- Can reasoning in free text then formatting separately recover performance?
- What makes some sentences in reasoning traces have disproportionate causal influence?
- What sparse mechanistic structures drive reasoning traces in language models?
- Can latent reasoning achieve the same substitution without tokens?
- When is detailed step-by-step reasoning actually counterproductive for solving a problem?
- How much does chain-of-thought reasoning narrow the decompression gap?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- Why do models rarely admit to their actual reasoning in chain-of-thought traces?
- Do shorter reasoning chains maintain instruction adherence better than longer ones?
- What makes thought identifiability provable without auxiliary training data?
- Why do language models produce unfaithful chain of thought explanations?
- Does chain of thought reasoning faithfully reflect what a model actually believes?
- Can minimal reasoning steps match verbose reasoning accuracy?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Why do concise reasoning chains match verbose chain-of-thought token efficiency?
- Why do expert reasoners skip steps that novices must state explicitly?
- Why does chain-of-thought fail to improve multimodal model perception performance?
- How does program-aided reasoning externalize intermediate computation into executable form?
- Can argumentation structure improve reasoning through decomposition alone?
- Do longer chain-of-thought traces improve interpretability or just performance?
- Are chain-of-thought traces anthropomorphizing how AI models really reason?
- Can chain-of-thought traces harm rather than help user understanding?
- How much of a reasoning trace is actually redundant or unnecessary?
- How does explicit reasoning transparency differ from internal chain-of-thought explanations?
- How does faithfulness differ from informativeness in chain-of-thought evaluation?
- Can bounded workspaces prevent overthinking better than summarization alone?
- What makes answer equivalence sufficient to discard a reasoning path?
- Does chain-of-thought accuracy degrade with longer reasoning traces?
- Can reasoning happen in latent space without chain of thought?
- How do completeness scaffolds force explicit step-by-step derivation?
- How much do compressed reasoning traces transfer across different models?
- How can benchmark accuracy scores mask the absence of interpretable reasoning structure?
- Why does unstructured chain-of-thought permit assumption-based errors that templates prevent?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Why do language model reasoning chains look fluent when they deviate from the task?
- What makes a reasoning explanation faithful rather than just plausible?
- Do computational systems need formal argument analysis for explainability?
- What makes some bottlenecks invisible to chain-of-thought training?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- How does supervised fine-tuning degrade chain-of-thought faithfulness over time?
- Can we detect redundant reasoning steps during model inference instead of training?
- How much of chain-of-thought reasoning actually diverges from the final answer?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- Can single representation edits match chain-of-thought reasoning without explicit steps?
- What latent reasoning capability do base models already possess before training?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
CoD isolates what trace content is computationally necessary; the 92.4% of tokens removed are the stylistic layer
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
CoD multiplies the benefit: same budget, more parallel chains, each chain minimal
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
CoD inverts the overthinking frame: instead of adding tokens until degradation, start minimal and add only when accuracy demands it
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
verbose CoT extends into the variance-inflating range; minimal CoD stays in the efficient range
-
Can we steer reasoning toward brevity without retraining?
This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.
mechanistic explanation: CoD prompting pushes toward the same activation region that ASC steering vectors target directly; orthogonal and combinable
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
CoD amplifies adaptive allocation: when each chain uses 7.6% of standard CoT tokens, the same compute budget supports 13x more parallel chains or can be redistributed to harder prompts that genuinely need more reasoning depth
-
Why does chain of thought accuracy eventually decline with length?
Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
CoD operationalizes the inverted-U finding: capable models prefer shorter chains because the reasoning signal is concentrated in minimal critical operations, not distributed across verbose explanation; CoD's 7.6% token count matches the prediction that the optimal length for capable models is far shorter than standard CoT
-
Do reasoning models switch between ideas too frequently?
Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
CoD addresses underthinking from the format side: minimal per-step drafts enforce depth within each step by eliminating the verbal runway for thought-switching; where TIP penalizes switching tokens at decoding time, CoD prevents the verbose intermediate context that enables switching in the first place
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
CoD validates the compression phase: curriculum training discovers strategies with generous budgets then compresses, and CoD demonstrates that the compressed endpoint (7.6% of tokens) retains full accuracy — confirming that the generous-to-tight curriculum removes filler rather than substance
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Break the Chain: Large Language Models Can be Shortcut Reasoners
- Answering Questions by Meta-Reasoning over Multiple Chains of Thought
- Chain of Draft: Thinking Faster by Writing Less
- Hierarchical Reasoning Model
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
- Self-Evaluation Guided Beam Search for Reasoning
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Original note title
concise intermediate reasoning chains match verbose cot accuracy with 7.6 percent of the tokens