Can we detect redundant reasoning steps during model inference instead of training?
This explores whether the wasteful, redundant parts of a model's chain-of-thought can be spotted and cut while it's actually running — at inference — rather than being trained out beforehand.
This explores whether the wasteful, redundant parts of a model's chain-of-thought can be spotted and cut while it's actually running, rather than being trained out beforehand — and the corpus says yes, the signal is already sitting in the model's own internals at inference time. The cleanest example: a framework that categorizes reasoning into six step-types and reads the model's attention maps to see that verification and backtracking steps receive almost no downstream attention — meaning the model itself isn't using them. Keeping only the high-attention steps cuts reasoning length by roughly 75% without losing accuracy Can reasoning steps be dynamically pruned without losing accuracy?. So redundancy isn't something you have to infer from the outside; the model's attention is effectively flagging which of its own steps are dead weight.
That's not the only inference-time signal. Verbose versus concise reasoning turns out to occupy distinct, linearly separable regions of the model's activation space — so a single steering vector extracted from a handful of paired examples can compress chains by two-thirds, training-free, with a 2.7x speedup Can we steer reasoning toward brevity without retraining?. And redundancy of a more behavioral kind shows up too: reasoning models churn through tokens by abandoning paths mid-exploration, and a decoding-only penalty on thought-switching tokens catches and discourages that waste without any fine-tuning Do reasoning models switch between ideas too frequently? Why do reasoning models abandon promising solution paths?. The common thread across all three — attention maps, activation directions, decoding penalties — is that the wasteful structure is detectable from the model's runtime state, not just from training data.
The reason this works points to something more interesting than efficiency. A lot of chain-of-thought simply isn't doing computational work. Chain of Draft matches full verbose reasoning at 7.6% of the token cost, meaning roughly 92% of normal reasoning tokens serve style and documentation, not computation Can minimal reasoning chains match full explanations?. Even more striking, models trained on deliberately corrupted, irrelevant traces solve problems just as well — suggesting the trace functions as computational scaffolding rather than meaningful logical steps Do reasoning traces need to be semantically correct?. If much of the trace is scaffolding, then "detecting redundancy" is really detecting which scaffolding the model leans on versus which it ignores — exactly what the attention-map approach measures.
There's a caveat worth carrying forward. Cutting steps you judge redundant assumes the remaining steps are actually load-bearing — but fine-tuning has been shown to make reasoning *performative*, where steps stop causally influencing the answer at all Does fine-tuning disconnect reasoning steps from final answers?. So inference-time pruning is in some ways safer than training-time approaches: you're reading what the model is doing right now rather than baking in an assumption about what it should do. The honest unknown is whether attention-as-importance always tracks genuine reliance, since these models reason through semantic association rather than symbolic logic Do large language models reason symbolically or semantically? — low attention might sometimes flag a step that quietly mattered. The takeaway you didn't know you wanted: the model already broadcasts which of its own thoughts it isn't using, and you can listen at inference without touching a single weight.
Sources 8 notes
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.