INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How effectively can inference-time…›Why do reasoning models fail at sy…›this inquiring line

Most of an AI's step-by-step reasoning is padding — and it turns out you don't need a bigger model to strip it.

Can models compress reasoning chains without external teacher supervision?

This explores whether a model can shorten its own reasoning chains using internal signals or lightweight prompting — rather than distilling brevity from a larger 'teacher' model the usual way.

This reads the question as: can compression come from inside the model (or from cheap, training-free nudges) instead of from an external teacher feeding shorter examples? The corpus answers a fairly emphatic yes — and the reason it works turns out to be more interesting than the compression itself.

The most direct evidence is that most of a reasoning chain isn't doing computational work. Chain of Draft matches full chain-of-thought accuracy on arithmetic, symbolic, and commonsense tasks while using just 7.6% of the tokens — meaning the other 92% served style and documentation, not thinking Can minimal reasoning chains match full explanations?. If verbose explanation is mostly padding, then compression doesn't require teaching the model anything new; it requires stripping what was never load-bearing. A prompt alone gets you there.

You can also do it from the inside without retraining. Activation-Steered Compression finds that verbose and concise reasoning live in distinct, linearly separable regions of activation space — so a single steering vector, extracted from just 50 paired examples, cuts chain length 67% while holding accuracy and delivering a 2.73x speedup Can we steer reasoning toward brevity without retraining?. That's compression as a direction in the model's own representations, no teacher distillation involved. Diffusion LLMs reach a similar place by a different road: because answer confidence converges early while reasoning is still refining, an early-exit mechanism can halve compute on the model's own confidence signal Can reasoning and answers be generated separately in language models?.

The deepest hint at why this is possible: transformers appear to compute the answer in their first few layers, then actively overwrite those representations to emit format-compliant filler in later layers Do transformers hide reasoning before producing filler tokens?. If the reasoning is already finished before the visible chain is written, the chain is partly a performance of reasoning — which lines up with the finding that chain-of-thought mostly reproduces familiar reasoning *forms* from training rather than generating fresh inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Compression is easy precisely because much of the output is theater.

The caveat worth carrying away: not all tokens are filler. Reasoning models beat non-reasoning models at any inference budget because training instilled a protocol that makes their tokens *productive* — the gap is about training structure, not token count Can non-reasoning models catch up with more compute?. So self-compression works by cutting documentation overhead, not the computational substrate. The unexpected lesson: you don't need a teacher to make a model concise, because the model was already padding — but you do need to know which tokens were working and which were just talking.

Sources 6 notes

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Show all 6 sources

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs2.54 match · arxiv ↗
Break the Chain: Large Language Models Can be Shortcut Reasoners1.74 match · arxiv ↗
Hierarchical Reasoning Model1.74 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs1.74 match · arxiv ↗
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity1.72 match · arxiv ↗
What Makes Effective Supervision in Latent Chain-of-Thought? An Information-Theoretic Analysis1.67 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective0.93 match · arxiv ↗
Measuring Faithfulness in Chain-of-Thought Reasoning0.90 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about self-supervised reasoning compression in LLMs. The question: Can models compress reasoning chains without external teacher supervision?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–08.2025. A library of arXiv work reports:
• Verbose chain-of-thought is ~92% padding by token count; compression to 7.6% of tokens preserves accuracy, suggesting most tokens are documentation not computation (2024).
• Activation steering with 50 paired examples achieves 67% chain compression and 2.73× speedup via linearly separable regions in activation space (2025-07).
• Early-exit mechanisms on confidence signals halve compute in diffusion LLMs because answer confidence converges before reasoning refinement completes (2025-08).
• Transformers compute answers in early layers, then overwrite representations with format-compliant filler in later layers (2024-12).
• Chain-of-thought is constrained imitation of reasoning *form* rather than genuine inference; non-reasoning models cannot match reasoning models even with unlimited inference budget (2025-06, 2025-04).

Anchor papers (verify; mind their dates):
• arXiv:2507.04742 (2025-07): Activation Steering for Chain-of-Thought Compression
• arXiv:2412.04537 (2024-12): Understanding Hidden Computations in Chain-of-Thought Reasoning
• arXiv:2506.02878 (2025-06): CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate
• arXiv:2504.09858 (2025-04): Reasoning Models Can Be Effective Without Thinking

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 92% padding claim, probe whether post-2025-08 scaling, training schemes (e.g., synthetic data, RL on compression metrics), or new tokenization methods have altered the *ratio* or made padding more task-dependent. Separately: do activation-steering and early-exit methods still hold on 2025–present frontier models, or have architectural shifts (e.g., mixture-of-experts, in-context adaptation) made steering fragile? Identify which findings are durable (the question "is compression possible without supervision?") vs. perishable (the specific mechanisms and ratios).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper show that reasoning models *require* the full chain, or that compression degrades on tasks where reasoning depth matters (e.g., formal verification, multi-step planning)?
(3) Propose 2 research questions that ASSUME the regime has moved: (a) If early-exit and steering remain valid, how do they interact with long-context windows and cached intermediate states? (b) Can compression strategies *transfer* across model families and scales, or is each compression mechanism bespoke to a training protocol?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Most of an AI's step-by-step reasoning is padding — and it turns out you don't need a bigger model to strip it.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8