Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

Synthesis note · 2026-02-23 · sourced from Cognitive Models Latent

Thought Preference Optimization (TPO) reveals a counterintuitive dynamic in three stages:

Stage 1: Thinking hurts. An instruction-tuned model prompted to write internal thoughts before responding performs worse than the same model responding directly. This aligns with meta-analysis findings that CoT prompting only helps math and logic tasks. For general instruction following (creative writing, planning, understanding complex instructions), initial thoughts are not just unhelpful — they actively degrade performance. The instruction-tuned model has been heavily optimized for direct responses, and inserting unoptimized thoughts disrupts that optimization.

Stage 2: RL teaches useful thinking. Through iterative RLAIF training, the model learns to generate thoughts that actually improve responses. The key design: a standard judge model evaluates only the response, never seeing the hidden thoughts. This forces the model to develop thoughts that produce better responses rather than thoughts that look good to an evaluator. No human-curated thoughts or specialized thought-judge required.

Stage 3: Broad utility emerges. After training, thinking improves performance across general instruction-following tasks — not just math and logic. Internal thoughts serve planning (overall structure and characters for creative writing), instruction comprehension (parsing complex user requests), and strategy selection.

Two design principles matter. First, hiding the thoughts from the judge avoids the need for a thought-evaluation model — which would be inherently challenging since human thoughts are poorly documented and may not transfer to LLM thinking. Second, allowing thoughts to take "uninteresting" forms (making mistakes, drafting and evaluating responses, trying to understand the question) is essential — these forms would typically be pruned by a thought-evaluating judge but are precisely what makes thoughts useful.

This connects directly to Does RL teach reasoning or just when to use it?: TPO provides concrete evidence that RL teaches when and how to deploy internal reasoning for different task types, not the reasoning capability itself. The capability was already present (the model could generate thoughts from the start) — what was missing was the optimization signal for making thoughts serve responses.

The overthinking connection is also important. Since Does more thinking time actually improve LLM reasoning?, TPO demonstrates the mechanism: unoptimized thinking actively hurts. Only RL-trained thinking helps. The quality of thinking matters more than its quantity.

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How should iterative research systems allocate reasoning per search step?

How does overthinking in early turns degrade later retrieval rounds?

What properties determine whether reward signals teach genuine reasoning?

What role does task structure play in rewarding delayed thinking?

Do base models contain latent reasoning that training can unlock?

Can we predict when a model will develop thinking behaviors?

When do additional thinking tokens stop improving reasoning performance?

When does extended thinking hurt performance on easier problems?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 146 in 2-hop network ·dense cluster Open in graph ↗

Why does asking models to think first hurt perfo… Does RL teach reasoning or just when to use it? Does more thinking time actually improve LLM reaso… When does explicit reasoning actually help model p… Does thinking emerge when agents choose between le… Can models learn when to think versus respond quic…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
TPO provides direct evidence: RL teaches deployment of thinking, not thinking itself
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
TPO shows unoptimized thinking hurts; explains WHY more thinking can degrade performance
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
TPO overcomes this split: RL-trained thinking helps across task types
Does thinking emerge when agents choose between learned sub-policies? Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
related mechanism: thinking emerges when RL provides selection pressure
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
TPO is a concrete implementation of this principle

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

internal thought generation initially degrades performance until rl training adapts thoughts to serve responses — extending thinking beyond math and logic

Why does asking models to think first hurt performance?

Inquiring lines that read this note 4

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4