Can models compress reasoning chains without external teacher supervision?
This explores whether a model can shorten its own reasoning chains using internal signals or lightweight prompting — rather than distilling brevity from a larger 'teacher' model the usual way.
This reads the question as: can compression come from inside the model (or from cheap, training-free nudges) instead of from an external teacher feeding shorter examples? The corpus answers a fairly emphatic yes — and the reason it works turns out to be more interesting than the compression itself.
The most direct evidence is that most of a reasoning chain isn't doing computational work. Chain of Draft matches full chain-of-thought accuracy on arithmetic, symbolic, and commonsense tasks while using just 7.6% of the tokens — meaning the other 92% served style and documentation, not thinking Can minimal reasoning chains match full explanations?. If verbose explanation is mostly padding, then compression doesn't require teaching the model anything new; it requires stripping what was never load-bearing. A prompt alone gets you there.
You can also do it from the inside without retraining. Activation-Steered Compression finds that verbose and concise reasoning live in distinct, linearly separable regions of activation space — so a single steering vector, extracted from just 50 paired examples, cuts chain length 67% while holding accuracy and delivering a 2.73x speedup Can we steer reasoning toward brevity without retraining?. That's compression as a direction in the model's own representations, no teacher distillation involved. Diffusion LLMs reach a similar place by a different road: because answer confidence converges early while reasoning is still refining, an early-exit mechanism can halve compute on the model's own confidence signal Can reasoning and answers be generated separately in language models?.
The deepest hint at why this is possible: transformers appear to compute the answer in their first few layers, then actively overwrite those representations to emit format-compliant filler in later layers Do transformers hide reasoning before producing filler tokens?. If the reasoning is already finished before the visible chain is written, the chain is partly a performance of reasoning — which lines up with the finding that chain-of-thought mostly reproduces familiar reasoning *forms* from training rather than generating fresh inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Compression is easy precisely because much of the output is theater.
The caveat worth carrying away: not all tokens are filler. Reasoning models beat non-reasoning models at any inference budget because training instilled a protocol that makes their tokens *productive* — the gap is about training structure, not token count Can non-reasoning models catch up with more compute?. So self-compression works by cutting documentation overhead, not the computational substrate. The unexpected lesson: you don't need a teacher to make a model concise, because the model was already padding — but you do need to know which tokens were working and which were just talking.
Sources 6 notes
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.