Why do different model training approaches produce different overthinking thresholds?
This explores why the point at which a model starts 'overthinking' — burning extra reasoning tokens that hurt rather than help — depends on how it was trained, not just how much it thinks.
This explores why the threshold for harmful overthinking shifts depending on training approach: where one model peaks then degrades, another holds steady, because training shapes *how* a model uses its reasoning, not just how much of it it does. The corpus suggests overthinking isn't a fixed property of long reasoning — it's a downstream symptom of what the training objective rewarded.
The clearest evidence that training mediates the threshold rather than token count alone comes from work showing the same mechanism can flip from harmful to helpful. In vanilla models, extended thinking induces counterproductive self-doubt that degrades answers; RL training reverses this, turning the identical 'thinking mode' into productive gap analysis Does extended thinking help or hurt model reasoning?. So the threshold moves because training changes the *content* of extra tokens, not their quantity. Meanwhile the raw phenomenon is real and sharp: accuracy can peak at a critical token count and then fall off a cliff — 87% down to 70% as tokens scale up — with the extra length adding self-revision errors rather than answers When does thinking too much actually hurt reasoning?. Two models trained differently will hit that cliff at different places.
Much of the variation traces to what the reward signal taught the model to do at the margins. Reasoning models are optimized to *produce* reasoning steps but never taught *when to disengage*, so they generate redundant chains even for ill-posed questions a non-reasoning model would simply reject Why do reasoning models overthink ill-posed questions?. Reward design pushes this further: binary correctness rewards incentivize confident guessing and miscalibration Does binary reward training hurt model calibration?, and training on near-impossible RLVR samples teaches degenerate shortcuts — answer repetition, computation-skipping — that contaminate genuine reasoning Do overly hard RLVR samples actually harm model capabilities?. Each of these training choices nudges the overthinking threshold in a different direction.
What you may not expect: the corpus suggests the threshold is often *latent in the base model* and merely surfaced or suppressed by post-training. Five independent methods all elicit reasoning that already lives in base-model activations — post-training selects rather than creates it Do base models already contain hidden reasoning ability?. That's why so many fixes work *without* retraining at all: verbose versus concise chains occupy distinct, linearly-steerable regions of activation space Can we steer reasoning toward brevity without retraining?, confidence variance can diagnose over- versus under-thinking and steer between them Can confidence patterns reveal overthinking versus underthinking?, and a decoding penalty on thought-switching curbs the *opposite* failure — underthinking, where models abandon paths too early Do reasoning models switch between ideas too frequently?. Overthinking and underthinking are two ends of one dial that training sets.
The deeper takeaway is that this failure mode isn't unique to token-level reasoning. Iterative refinement at the response level reproduces the exact same architecture — accumulating noise without guaranteed improvement Do iterative refinement methods suffer from overthinking? — and search-based research agents follow the same diminishing-returns scaling curve as reasoning tokens Do search steps follow the same scaling rules as reasoning tokens?. So 'why do different training approaches produce different thresholds' generalizes: any process that rewards more-of-an-action without rewarding when-to-stop will set its overthinking cliff wherever the training signal left the off-switch.
Sources 11 notes
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.