Can penalizing reasoning transitions fix underthinking without fine-tuning models?
This explores whether 'underthinking' — reasoning models bailing on promising lines of thought too early — can be fixed purely at decoding time by discouraging the model from switching thoughts, without any retraining.
This explores whether underthinking (models abandoning good reasoning paths prematurely) can be corrected purely at decoding time, no retraining required. The corpus answers fairly directly: yes. The anchor result frames underthinking as premature thought-switching and introduces a decoding-only penalty on thought-transition tokens (the TIP strategy) that raises accuracy on hard math problems without touching model weights Do reasoning models switch between ideas too frequently?. A companion diagnosis paints the fuller picture: reasoning models fail less from lack of compute and more from structural disorganization — they 'wander' through invalid exploration and 'underthink' by jumping ship too soon, like tourists rather than scientists — and the same family of decoding-level interventions improves accuracy because the better solution was already in reach, just abandoned Why do reasoning models abandon promising solution paths?.
What makes this more than a one-off trick is that penalizing transitions is one instance of a broader, surprising pattern: a lot of reasoning quality can be steered without gradient updates. Verbosity, it turns out, lives along a single linear direction in activation space — extract one vector from ~50 paired examples and you cut chain-of-thought length by two-thirds while holding accuracy, entirely training-free Can we steer reasoning toward brevity without retraining?. ReBalance pushes the same idea to the underthinking problem specifically: it reads confidence variance as a live signal of whether the model is overthinking or underthinking, then applies steering vectors to encourage exploration exactly when the model is about to quit too early Can confidence patterns reveal overthinking versus underthinking?. So the transition penalty isn't isolated — it sits alongside activation steering and confidence-based steering as decode-time levers on the same dial.
The reason these levers work at all connects to a deeper claim in the corpus: post-training mostly selects reasoning that base models already latently contain, rather than installing it. Five independent mechanisms — RL, critique fine-tuning, decoding changes, SAE feature steering, RLVR — all elicit capability already present in activations, which reframes the bottleneck as elicitation, not acquisition Do base models already contain hidden reasoning ability?. If the good reasoning path already exists and the failure is that the model walks away from it, then a decoding nudge to stay the course is exactly the right-sized fix — you don't need to teach anything new.
There's a useful tension worth knowing about, though. Underthinking is only one side of the coin; its opposite, overthinking, is just as real, and the cure for one can aggravate the other. Optimal chain-of-thought length follows an inverted U — accuracy peaks at intermediate length and then declines, with more capable models preferring shorter chains Why does chain of thought accuracy eventually decline with length?. Reasoning models also overthink ill-posed questions, churning out redundant steps because training taught them to keep reasoning but never taught them when to disengage Why do reasoning models overthink ill-posed questions?. A blunt transition penalty that says 'never switch' risks trapping the model on a genuinely wrong path. That's why the confidence-aware approaches matter — and why some researchers go the training route instead, letting models learn to route between extended thinking and quick answers (Thinkless) Can models learn when to think versus respond quickly?.
One quiet argument in favor of staying training-free: fine-tuning has a documented cost to reasoning faithfulness. After fine-tuning, chains of thought less reliably drive the final answer — early termination, paraphrasing, and filler substitution all leave the answer unchanged more often, suggesting the reasoning becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. So fixing underthinking with a decode-time penalty isn't just cheaper than retraining — it sidesteps a way that retraining can quietly hollow out the very reasoning you were trying to improve.
Sources 9 notes
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.