INQUIRING LINE

Why do human-curated thought examples fail to improve model thinking?

This explores why feeding models clean, human-written examples of 'good thinking' (curated reasoning traces, labeled exemplars, polished solutions) often doesn't make them reason better — and what the corpus says actually does.


This reads the question as: when we hand a model tidy, human-curated examples of good reasoning and train on them, why doesn't the model's actual thinking improve? The corpus has a surprisingly sharp answer — clean examples teach the *look* of reasoning, not the *engine* of it. Models trained to imitate confident, fluent reasoning mostly learn surface style. Can imitating ChatGPT fool evaluators into thinking models improved? shows imitation models fool human evaluators by mimicking the tone of a stronger model while closing no real capability gap. Do reasoning traces show how models actually think? pushes this further: reasoning traces themselves are persuasive appearances, where logically invalid steps perform almost as well as valid ones — so curating examples for their *correct appearance* optimizes the wrong thing.

The deeper problem is that polished examples strip out exactly what's useful. Does training on messy search processes improve reasoning? found that training on the messy search process — wrong turns, dead ends, backtracking — beats training on clean optimal trajectories by 25%. The mistakes are the lesson: they teach the model an internal model of *how to search*, which a curated 'here's the right answer' example deletes. Human-curated thought is curated precisely to hide the wandering, and the wandering is where the skill lives.

There's also a transfer failure. Can models learn argument quality from labeled examples alone? shows fine-tuning on labeled quality examples lets models learn surface patterns rather than the principle behind them — they don't generalize to new cases. What worked instead was explicit theoretical frameworks (naming the criteria directly) rather than hoping the model would induce them from examples. Examples under-specify the rule; the model latches onto whatever shortcut fits the sample.

And there may be nothing to install in the first place. Do base models already contain hidden reasoning ability? argues post-training *selects* reasoning already latent in the base model rather than creating it — five independent methods all elicit the same buried capability. Does extended thinking help or hurt model reasoning? complements this: the same 'thinking' mechanism can hurt or help depending on training that redirects it from self-doubt into productive analysis. If the bottleneck is elicitation and redirection, then mimicking curated examples is the wrong tool — it copies outputs instead of steering the latent process.

The unexpected takeaway: better model thinking seems to come from showing the struggle, naming the principle, or unlocking what's already there — not from showing the polished result. If you want to go deeper on the 'examples teach style not substance' thread, start with Do reasoning traces show how models actually think?; for the 'mess is the lesson' thread, Does training on messy search processes improve reasoning?.


Sources 6 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether human-curated reasoning examples still fail to improve model thinking, or whether recent advances have shifted the constraint.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable:
• Imitation of polished reasoning teaches *style* not capability — models fool evaluators while closing no real gap (2023).
• Training on messy search processes — wrong turns, backtracking — outperforms clean optimal trajectories by ~25% (2024).
• Fine-tuning on quality examples produces surface-pattern learning that fails to generalize; explicit theoretical frameworks work better (2024).
• Reasoning capability is latent in base models; post-training *selects* it via five independent methods that elicit the same buried process (2025).
• RL can redirect the same reasoning mechanism from self-doubt into productive analysis (2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.15717 — The False Promise of Imitating Proprietary LLMs (2023)
• arXiv:2404.03683 — Stream of Search: Learning to Search in Language (2024)
• arXiv:2505.20296 — Reasoning LLMs are Wandering Solution Explorers (2025)
• arXiv:2604.15726 — LLM Reasoning Is Latent, Not the Chain of Thought (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, ask: do newer models (o1, o3, frontier models post-June 2026), improved training recipes (constitutional AI, outcome-based RL, multi-objective tuning), or better evals reveal that curated examples NOW improve thinking? Where does the style/substance gap still hold? Cite what resolves or reinforces it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper showing curated examples DO transfer, or showing a regime shift where polishing helps.
(3) Propose 2 research questions that assume the bottleneck may have moved: e.g., from 'examples don't teach process' to 'curated examples need interactive scaffolding' or 'examples fail without parallel latent-unlock training'.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines