SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment

Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

Synthesis note · 2026-02-23 · sourced from Cognitive Models Latent

Thought Preference Optimization (TPO) reveals a counterintuitive dynamic in three stages:

Stage 1: Thinking hurts. An instruction-tuned model prompted to write internal thoughts before responding performs worse than the same model responding directly. This aligns with meta-analysis findings that CoT prompting only helps math and logic tasks. For general instruction following (creative writing, planning, understanding complex instructions), initial thoughts are not just unhelpful — they actively degrade performance. The instruction-tuned model has been heavily optimized for direct responses, and inserting unoptimized thoughts disrupts that optimization.

Stage 2: RL teaches useful thinking. Through iterative RLAIF training, the model learns to generate thoughts that actually improve responses. The key design: a standard judge model evaluates only the response, never seeing the hidden thoughts. This forces the model to develop thoughts that produce better responses rather than thoughts that look good to an evaluator. No human-curated thoughts or specialized thought-judge required.

Stage 3: Broad utility emerges. After training, thinking improves performance across general instruction-following tasks — not just math and logic. Internal thoughts serve planning (overall structure and characters for creative writing), instruction comprehension (parsing complex user requests), and strategy selection.

Two design principles matter. First, hiding the thoughts from the judge avoids the need for a thought-evaluation model — which would be inherently challenging since human thoughts are poorly documented and may not transfer to LLM thinking. Second, allowing thoughts to take "uninteresting" forms (making mistakes, drafting and evaluating responses, trying to understand the question) is essential — these forms would typically be pruned by a thought-evaluating judge but are precisely what makes thoughts useful.

This connects directly to Does RL teach reasoning or just when to use it?: TPO provides concrete evidence that RL teaches when and how to deploy internal reasoning for different task types, not the reasoning capability itself. The capability was already present (the model could generate thoughts from the start) — what was missing was the optimization signal for making thoughts serve responses.

The overthinking connection is also important. Since Does more thinking time actually improve LLM reasoning?, TPO demonstrates the mechanism: unoptimized thinking actively hurts. Only RL-trained thinking helps. The quality of thinking matters more than its quantity.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 144 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

internal thought generation initially degrades performance until rl training adapts thoughts to serve responses — extending thinking beyond math and logic