Does extended thinking help or hurt model reasoning?
Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.
The proactive critical thinking experiments reveal a striking interaction between training and inference-time reasoning. For vanilla (off-the-shelf) models, activating "thinking mode" — the extended internal reasoning chains used by models like Qwen3 — actually degrades performance on proactive critical thinking tasks. The extended thinking "appears to induce counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance."
But after RL training on proactive critical thinking tasks, the same thinking mode becomes beneficial. Training fundamentally changes how models use their internal reasoning. This is not merely about more or less thinking — it is about the quality direction of thinking.
The finding connects to several established insights but adds a distinct mechanism:
Since Does RL teach reasoning or just when to use it?, RL manages the timing of reasoning. The proactive thinking result extends this: RL also manages the mode of reasoning — redirecting extended thinking from unproductive self-doubt toward productive gap analysis.
The SFT finding adds nuance: when SFT data is self-generated by the model, it "does not inherently enhance its capabilities" and may reduce output entropy, constraining the subsequent RL phase. This echoes Does policy entropy collapse limit reasoning performance in RL? — SFT-then-RL may face the same entropy collapse that pure RL faces, but through a different mechanism (entropy reduction from self-generated imitation rather than RL convergence).
The practical implication: extended thinking is not a universal good. It is a resource that can be directed productively or destructively, and the direction depends on training. "More thinking" applied to a model without the right training signal may systematically make things worse.
Inquiring lines that use this note as a source 118
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes emotional alignment more effective than logic when reasoning errors are exposed?
- Do spurious rewards activate reasoning without teaching new skills?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- When should action deliberation trigger during reasoning steps?
- What makes training-free approaches like Soft Thinking preferable to SoftCoT?
- Do high-influence thoughts align with SAND deliberation triggers?
- Can proactive critical thinking alone enable models to request clarification effectively?
- Why must procedural skills consolidate before strategic reasoning can develop?
- Does training for better reasoning reduce an AI system's ability to abstain?
- What makes reasoning capability a pre-training rather than post-training phenomenon?
- Do explicit reasoning chains improve or harm performance on complex judgment tasks?
- Can extended thinking genuinely improve reasoning or just increase variance?
- Why do more capable models prefer shorter chains of thought?
- Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?
- Does explicit reasoning help or hurt tasks requiring continuous nuanced judgment?
- When should an LLM engage extended reasoning versus responding directly?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- Why does early intervention matter more than late intervention in knowledge collapse?
- What triggers overthinking versus underthinking in reasoning models?
- Can parallel thinking outperform sequential thinking under the same token budget?
- Why does reasoning accuracy degrade beyond a critical thinking token threshold?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- How do thinking tokens function as mutual information peaks in reasoning?
- When does explicit reasoning actually degrade performance on a task?
- Can extended reasoning training capture individual strategic thinking styles?
- Can proactive critical thinking train models to request clarification actively?
- Why does step-by-step reasoning degrade performance on judgment-based tasks?
- Do reasoning models trade instruction following for deliberative capability?
- Why do human-curated thought examples fail to improve model thinking?
- How can judges evaluate thinking without seeing the actual thoughts?
- What role does confidence play in balancing overthinking versus underthinking?
- Why might latent reasoning capture types of thinking that verbalized CoT cannot?
- Does reasoning structure match explicit versus implicit task demands?
- How does evaluation format change what we measure about model reasoning?
- Why does extended reasoning fail for search and knowledge retrieval tasks?
- Are traditional cognitive theories missing interaction effects between mechanisms?
- Why does reasoning effort fail to improve theory of mind performance?
- Does distillation from reasoning models spread overthinking to smaller models?
- How does proactive critical thinking enable models to identify missing information?
- Why does extended thinking increase output variance without improving reasoning quality?
- Does deep-thinking ratio measure computational effort better than chain-of-thought length?
- Do reasoning models become more vulnerable to persona-induced bias than standard models?
- Can extended deliberation in agents become counterproductive like human overthinking?
- Does thought consolidation address the confirmatory reflection problem in reasoning models?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- Can models distinguish between activated knowledge and genuine reasoning?
- What happens to reasoning accuracy when models use more thinking tokens?
- Does formal reasoning training actively degrade social reasoning ability?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- What three factors actually drive chain of thought performance improvements?
- What distinguishes redundant cycles from productive reconsidering cycles?
- Why does revision often make reasoning accuracy worse in frontier models?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- Can RL training teach models when to activate reasoning versus when to skip it?
- How do reasoning training methods sacrifice some thinking skills while improving others?
- How do reward models benefit from extended thinking during evaluation scoring?
- Can activation-space steering vectors replicate thinking model performance without retraining?
- Can extended RL training unlock genuinely new reasoning strategies models cannot discover otherwise?
- Do longer reasoning traces actually improve theory of mind accuracy?
- Does explicit reasoning help or hurt tasks requiring continuous judgment?
- Why does latent reasoning override no-think instructions in models?
- What other triggers can activate the latent reasoning capability?
- How does extended thinking affect variance in reasoning model outputs?
- When should a system choose extended thinking versus quick responses?
- How should timing for reasoning intervention be determined during inference?
- How does collaboration itself become a degradation mechanism in reasoning tasks?
- Does RL training actually restore the critical thinking that reasoning models lose?
- Does internal self-revision actually degrade reasoning accuracy in models?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- Do shorter reasoning chains maintain instruction adherence better than longer ones?
- Does the thinking box provide genuine reasoning or just token budget?
- How do emotional and social simulations enable better hypothetical reasoning?
- Do reasoning failures stem from strategy or from calculation breakdown?
- How much does extended thinking actually improve model reasoning ability?
- Does penalizing thought transitions improve reasoning without model retraining?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Can reasoning scaffolds help with nuanced judgment tasks like empathy?
- Why might social reasoning work differently than formal logical reasoning?
- Do extended thinking blocks access latent empathetic capabilities in models?
- Can thinking token density explain reasoning performance beyond total length?
- Can extended thinking modes introduce genuine rhetorical exploration to LLMs?
- Does reasoning training actively undermine the abstention capacity safety training created?
- Can models overthink and underthink at the same time?
- Why do different model training approaches produce different overthinking thresholds?
- Does more thinking always improve language model accuracy?
- Does task difficulty alone determine how many thinking tokens a model should use?
- What happens to model reasoning accuracy as thinking token requirements exceed critical thresholds?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- What is the distinction between teaching reasoning how versus when to activate?
- Can pretraining signals unlock latent reasoning that post-training merely activates?
- Why does reasoning volume fail to improve theory of mind performance?
- What distinguishes reasoning activation mechanisms across different training methods?
- What causes reasoning quality to degrade during long research tasks?
- Can benchmark improvements hide degradation of deliberative reasoning?
- Can thought quality alone be trusted to guide model training?
- Can a single model implement fast thinking, slow thinking, and tool use?
- How do timing and search internalization interact during reasoning post-training?
- Can conditioning generation on difficulty probes reduce overthinking on simple tasks?
- Does performative reasoning mask underlying uncertainty even on easy problems?
- Can format adaptation alone explain why reasoning enrichment improves instruction following?
- What training interventions could close the perception-action gap?
- Why does parallel thinking outperform sequential thinking under fixed token budgets?
- Why do longer reasoning chains explore like tourists instead of scientists?
- Can activation steering compress reasoning without retraining models?
- Why do thinking models execute longer tasks than standard language models?
- How does active reasoning through interaction differ from passive single-turn problem solving?
- Can structured questioning prompts improve reasoning beyond standard conversational training?
- What distinguishes metacognitive regulation from standard chain-of-thought reasoning?
- Can reasoning training fix sycophancy if it is not a reasoning failure?
- Why does pre-training provide the raw material for emergent thinking?
- How do thought actions represent policy improvement steps in practice?
- What role does task structure play in rewarding delayed thinking?
- Can we predict when a model will develop thinking behaviors?
- Why does extended reasoning training improve exploration without adding new capabilities?
- Is premature decision-making a form of underthinking in transformer models?
- Why does reflection in reasoning models often become theater rather than genuine thought?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RL manages timing; this paper shows RL also manages quality direction of reasoning
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
DeGRPO mode selection; proactive thinking adds a training-mediated quality dimension
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
SFT-then-RL may face entropy collapse through self-generated imitation
-
What critical thinking skills do reasoning models actually lose?
Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.
the thinking-mode reversal is a specific instance of the broader critical thinking problem: reasoning training optimizes one narrow type of thinking while degrading others; the proactive thinking result shows RL can selectively repair one form of degradation (self-doubt → gap analysis) while the critical thinking post documents the broader pattern
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Base Models Know How to Reason, Thinking Models Learn When
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- Eliciting Reasoning in Language Models with Cognitive Tools
Original note title
rl training transforms thinking mode from counterproductive self-doubt into beneficial proactive analysis — the same mechanism helps or hurts depending on training