SYNTHESIS NOTE

Does extended thinking help or hurt model reasoning?

Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.

Synthesis note · 2026-02-22 · sourced from Conversation Agents

The proactive critical thinking experiments reveal a striking interaction between training and inference-time reasoning. For vanilla (off-the-shelf) models, activating "thinking mode" — the extended internal reasoning chains used by models like Qwen3 — actually degrades performance on proactive critical thinking tasks. The extended thinking "appears to induce counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance."

But after RL training on proactive critical thinking tasks, the same thinking mode becomes beneficial. Training fundamentally changes how models use their internal reasoning. This is not merely about more or less thinking — it is about the quality direction of thinking.

The finding connects to several established insights but adds a distinct mechanism:

Since Does RL teach reasoning or just when to use it?, RL manages the timing of reasoning. The proactive thinking result extends this: RL also manages the mode of reasoning — redirecting extended thinking from unproductive self-doubt toward productive gap analysis.

The SFT finding adds nuance: when SFT data is self-generated by the model, it "does not inherently enhance its capabilities" and may reduce output entropy, constraining the subsequent RL phase. This echoes Does policy entropy collapse limit reasoning performance in RL? — SFT-then-RL may face the same entropy collapse that pure RL faces, but through a different mechanism (entropy reduction from self-generated imitation rather than RL convergence).

The practical implication: extended thinking is not a universal good. It is a resource that can be directed productively or destructively, and the direction depends on training. "More thinking" applied to a model without the right training signal may systematically make things worse.

Inquiring lines that read this note 125

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What properties determine whether reward signals teach genuine reasoning?

How do training data properties shape reasoning capability development?

How does latent reasoning compare to verbalized chain-of-thought?

What capability tradeoffs emerge when scaling model reasoning abilities?

How can models identify insufficient information and respond appropriately without guessing?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Why must procedural skills consolidate before strategic reasoning can develop?

Do base models contain latent reasoning that training can unlock?

Why do correct reasoning traces tend to be shorter than incorrect ones?

How should inference compute be adaptively allocated based on prompt difficulty?

When do additional thinking tokens stop improving reasoning performance?

How can AI systems learn from failures without cascading errors?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Do corrupted reasoning traces serve as effective supervision signals?

Why do human-curated thought examples fail to improve model thinking?

Can model confidence signals reliably improve reasoning quality and calibration?

Can ensemble evaluation methods reduce bias more than single judges?

How does evaluation format change what we measure about model reasoning?

Why do reasoning models fail at systematic problem-solving and search?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Are traditional cognitive theories missing interaction effects between mechanisms?

How does reasoning effort affect AI theory of mind performance?

Why do persona-level simulations fail to predict individual preferences accurately?

Do reasoning models become more vulnerable to persona-induced bias than standard models?

What actually drives chain-of-thought reasoning improvements in language models?

What three factors actually drive chain of thought performance improvements?

How does reasoning graph topology affect breakthrough insights and generalization?

What distinguishes redundant cycles from productive reconsidering cycles?

Why does self-revision increase model confidence while degrading accuracy?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

What distinguishes coherent reasoning from inaccurate but plausible predictions?

Does reinforcement learning teach reasoning or just when to reason?

Can AI systems balance emotional competence with factual reliability?

Do extended thinking blocks access latent empathetic capabilities in models?

Can prompting inject entirely new knowledge into language models?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can benchmark improvements hide degradation of deliberative reasoning?

How do self-generated feedback mechanisms enable effective model learning?

What training interventions could close the perception-action gap?

Do language models learn genuine linguistic structure or just surface patterns?

Why do thinking models execute longer tasks than standard language models?

What mechanisms drive sycophancy and how can we mitigate it?

Can reasoning training fix sycophancy if it is not a reasoning failure?

How should iterative research systems allocate reasoning per search step?

How does o1-style reasoning relate to learned search processes versus memorized solutions?

How can process reward models supervise complex reasoning traces?

Does process supervision recover reasoning accuracy better than outcome rewards in latent space?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 163 in 2-hop network ·dense cluster Open in graph ↗

Does extended thinking help or hurt model reason… Does RL teach reasoning or just when to use it? Can models learn when to think versus respond quic… Does policy entropy collapse limit reasoning perfo… What critical thinking skills do reasoning models …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RL manages timing; this paper shows RL also manages quality direction of reasoning
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
DeGRPO mode selection; proactive thinking adds a training-mediated quality dimension
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
SFT-then-RL may face entropy collapse through self-generated imitation
What critical thinking skills do reasoning models actually lose? Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.
the thinking-mode reversal is a specific instance of the broader critical thinking problem: reasoning training optimizes one narrow type of thinking while degrading others; the proactive thinking result shows RL can selectively repair one form of degradation (self-doubt → gap analysis) while the critical thinking post documents the broader pattern

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl training transforms thinking mode from counterproductive self-doubt into beneficial proactive analysis — the same mechanism helps or hurts depending on training

Does extended thinking help or hurt model reasoning?

Inquiring lines that read this note 125

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4