Can steering a single latent feature replicate chain-of-thought performance?
This explores whether nudging one internal 'reasoning' feature inside a model can stand in for writing out an explicit chain-of-thought — and what that equivalence implies about where reasoning actually lives.
This explores whether steering a single latent feature can replace explicit chain-of-thought prompting, and the corpus says: yes, at least sometimes, and the fact that it works is more interesting than the speedup. Researchers using sparse autoencoders found a single identifiable 'reasoning' feature that, when directly amplified, matches or beats chain-of-thought performance across six different model families Can we trigger reasoning without explicit chain-of-thought prompts?. Notably this reasoning mode switches on early in generation and overrides surface-level instructions — suggesting the capability isn't something the prompt creates so much as something the prompt happens to trigger.
That reframing is the real payload. If one internal knob reproduces what a paragraph of step-by-step text does, then the text was never the source of the reasoning — it was a lever. The corpus backs this up from several angles: base models already contain latent reasoning ability that minimal intervention unlocks, and five independent methods — RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, and RLVR — all elicit reasoning that's already sitting in base-model activations Do base models already contain hidden reasoning ability?. The bottleneck is elicitation, not acquisition. Chain-of-thought, on this view, is one of many ways to flip a switch that's already wired.
This connects to a quieter finding: reasoning behavior often turns out to be a *direction* in activation space rather than a property of the words. One vector extracted from just 50 paired examples can cut chain-of-thought length by two-thirds while holding accuracy steady Can we steer reasoning toward brevity without retraining?. So both whether the model reasons and how verbosely it reasons are steerable geometrically, without retraining. The verbose text we read may be a side effect of the internal state, not its cause.
There's a sharp tension worth sitting with. A large strand of the corpus argues chain-of-thought is constrained imitation of reasoning *form* — pattern-matching familiar schemata rather than genuine inference — which is why it degrades predictably outside its training distribution and why structurally valid-looking but logically broken prompts still succeed Does chain-of-thought reasoning reveal genuine inference or pattern matching? Does chain-of-thought reasoning actually generalize beyond training data? What makes chain-of-thought reasoning actually work?. If CoT is partly theater, and a single feature replicates it, that raises an uncomfortable question: is feature-steering unlocking real latent computation, or just reproducing the same imitation more cheaply? The corpus doesn't fully resolve this, but it does suggest the honest framing is 'we found the lever,' not 'we found the reasoning.'
The thing you might not have known you wanted to know: this whole line of work is pushing reasoning *off the page* entirely. Beyond steering existing features, researchers are building latent-thought vectors as a scaling dimension separate from parameters Can latent thought vectors scale language models beyond parameters? and sampling parallel latent trajectories to scale reasoning in width rather than depth Can reasoning systems scale wider instead of only deeper?. The visible chain-of-thought may end up being a transitional artifact — a human-readable shadow of computation that increasingly happens in the model's internal space, where a single steered feature is just the most direct way in.
Sources 8 notes
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.