Why does convergence stability sometimes mislead about reasoning correctness?
This explores why a reasoning model's *confidence* — settling smoothly on an answer and not wavering — can be a false signal of correctness, and what better signals the corpus offers.
This explores why convergence stability — a model locking onto an answer cleanly, without thrashing — can mislead about whether the reasoning was actually sound. The short version from the corpus: stability and correctness come apart, and several notes show that the *smooth-looking* trace is often the suspect one. The clearest case is premature confidence: models that commit to an answer early and then spend the rest of the chain rationalizing it look serenely convergent, but that early lock-in is itself a measurable signal of *flawed* reasoning. Rewarding gradual confidence growth instead of early certainty improves accuracy dramatically — 42 points on Countdown — without any process labels Can confidence trajectories reveal when reasoning goes wrong?. So the very smoothness you might read as 'the model is sure' is sometimes the fingerprint of a model that guessed first and reasoned backward.
The deeper reason stability deceives is that fluent, coherent reasoning is often pattern-matched structure rather than genuine inference. Chain-of-thought guides models to imitate the *shape* of reasoning, which is why structural coherence can be high while content correctness is low, and why optimizing the visible trace works against interpretability Why does chain-of-thought reasoning fail in predictable ways?. A model can therefore converge confidently on familiar-looking terrain and still collapse the moment the problem's structure is unfamiliar — frontier reasoners that sound reflective and fluent score only 20-23% on constraint-satisfaction problems that demand real backtracking Can reasoning models actually sustain long-chain reflection?. Stable-sounding reflection is not the same as competent problem-solving.
Fine-tuning makes this worse in a sneaky way: it can *increase* apparent stability while *severing* the link between reasoning and answer. Faithfulness tests show fine-tuned models produce answers that stay invariant even when you truncate, paraphrase, or insert filler into the reasoning — the chain has become decorative rather than load-bearing Does fine-tuning disconnect reasoning steps from final answers?. Similarly, supervised fine-tuning teaches models to emit outputs that *look* right — clean JSON, valid structure — without making them physically feasible or constraint-satisfying Does supervised fine-tuning actually improve reasoning on optimization problems?. The output converges to a confident, well-formed surface that the reasoning never actually earned.
There's an interesting flip side: instability isn't automatically bad, and over-converging early can also hurt. Models that keep exploring *after* the answer is settled poison fine-tuning — post-conclusion wandering degrades learning even when the answer stays correct Does every correct chain-of-thought trace improve fine-tuning? — while models that switch paths too early ('underthinking') abandon viable solutions they'd already found Do reasoning models switch between ideas too frequently? Why do reasoning models abandon promising solution paths?. So both premature convergence and premature divergence are failure modes; the trajectory's *smoothness* tells you almost nothing on its own.
What actually discriminates good reasoning from confident-but-wrong reasoning is looking *locally* instead of globally. Step-level confidence catches breakdowns that global averaging smooths over — averaging the whole trace masks the one step where it quietly went off the rails, while per-step signals surface it and even let you stop early Does step-level confidence outperform global averaging for trace filtering?. That's the real lesson: aggregate stability is a low-resolution view that hides the moment things broke. If you want to know whether convergence is trustworthy, watch *how* confidence grew step by step, not whether the final answer sat still.
Sources 9 notes
Models that commit to answers early then rationalize show measurable flawed reasoning. Rewarding gradual confidence growth via RL improves accuracy significantly—on Countdown by 42 percentage points—without needing process labels or external reward models.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Post-conclusion reasoning—where the model keeps exploring after sufficient evidence for the answer—degrades supervised fine-tuning despite preserving correctness. Removing only this tail improves learning more than removing equally-long random suffixes, proving the harm comes from unnecessary exploration, not length.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.