Understanding and Mitigating Premature Confidence for Better LLM Reasoning

Paper · arXiv 2605.24396 · Published May 23, 2026
Reinforcement Learning

Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model’s confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early—rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2× (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp.

Introduction. Chain-of-thought (CoT) reasoning (Wei et al., 2022) has driven much of the recent progress on hard reasoning tasks (Cobbe et al., 2021; Hendrycks et al., 2021; Suzgun et al., 2023), both through prompting (Wei et al., 2022; Kojima et al., 2022) and reinforcement learning (Jaech et al., 2024; Guo et al., 2025; Yang et al., 2025). Yet long CoTs frequently contain logical gaps, unjustified leaps, and contradictions, and the extra reasoning tokens often fail to deliver the capability gains they should (Sprague et al., 2025). Improving reasoning quality directly would require process reward models that score intermediate steps (Lightman et al., 2024; Uesato et al., 2022; Wang et al., 2024), but the step-level annotations needed to train them are expensive and scarce. As a result, RL on reasoning has largely relied on outcome rewards (Shao et al., 2024; Yu et al., 2026), which improve answers without examining how they were reached.

Discussion / Conclusion. We propose premature confidence—the phenomenon where a model commits to an answer before completing its reasoning chain—as a scalable, annotation-free metric for detecting low-quality CoT. We show that premature confidence strongly correlates with the number of reasoning flaws in the reasoning trace, validating it as a quantitative indicator of post-hoc rationalization. Building on this metric, we introduce a progressive confidence shaping, which penalizes prematurely confident reasoning during RL training. Experiments on Countdown, DAPO, AIME, and SciQA demonstrate that our method reduces reasoning flaws while maintaining or improving task accuracy. Finally, we identify two mechanistic factors—reasoning utility and reasoning accessibility—that jointly govern premature confidence, and show how task difficulty and model size modulate their interplay.