Can single representation edits match chain-of-thought reasoning without explicit steps?
This explores whether you can trigger reasoning by editing a single internal feature of a model — one nudge to its representations — instead of making it write out step-by-step chain-of-thought, and whether that shortcut actually performs as well.
This explores whether a single edit to a model's internal representations can match chain-of-thought (CoT) without the model writing any steps. The corpus has a direct answer and a deeper reason it's plausible. The direct evidence: researchers found a sparse-autoencoder reasoning feature that, when steered, matches or beats CoT performance across six model families — and it activates early in generation and overrides surface-level instructions, suggesting that 'reasoning mode' is a latent capability that explicit prompting merely switches on rather than constructs Can we trigger reasoning without explicit chain-of-thought prompts?. So the short answer is: yes, at least one single-feature edit reproduces the effect of spelling out the steps.
Why should that even be possible? Because a growing line of work argues the written steps were never doing as much computational work as they appear to. Several notes converge on the claim that CoT reproduces the *form* of reasoning through learned patterns rather than performing genuine symbolic inference What makes chain-of-thought reasoning actually work? Why does chain-of-thought reasoning fail in predictable ways?. The tells are striking: invalid CoT prompts work about as well as valid ones, training format shapes reasoning strategy far more than the actual domain, and shifting where a demo sits can swing accuracy 20% What makes chain-of-thought reasoning actually work?. If the visible chain is mostly stylistic scaffolding that conditions the model into a familiar mode, then bypassing the text and editing the underlying state directly is not a paradox — it's the efficient version of the same maneuver.
The 'steps are mostly surface' thesis shows up from the compression angle too. Chain of Draft matches verbose CoT accuracy using only 7.6% of the tokens — meaning roughly 92% of a normal chain served documentation and style, not computation Can minimal reasoning chains match full explanations?. Dynamic intervention can prune 75% of reasoning steps while holding accuracy, because verification and backtracking steps turn out to receive almost no downstream attention Can reasoning steps be dynamically pruned without losing accuracy?. A single representation edit is the limit case of this trend: if you can throw away three-quarters of the steps, maybe you can throw away all of them and keep the one latent switch that mattered.
The honest caveats are also in the corpus. The same imitation lens predicts that this kind of reasoning degrades systematically outside its training distribution — fluent but logically inconsistent under shifts in task, length, or format Does chain-of-thought reasoning actually generalize beyond training data? Does chain-of-thought reasoning reveal genuine inference or pattern matching?. So 'matches CoT' inherits CoT's ceiling rather than transcending it. There's also a faithfulness tension worth knowing: written chains already influence final answers less than they look like they should, and fine-tuning weakens that causal link further, making reasoning performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. A latent steer that skips the visible trace entirely buys efficiency but pays in interpretability — you lose the (already partly illusory) window into *why* the model answered as it did. The thing you didn't know you wanted to know: the reason single-feature steering can replace step-by-step reasoning is the same reason step-by-step reasoning was never quite the load-bearing computation we assumed.
Sources 9 notes
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.