INQUIRING LINE

Why does critique training produce deeper understanding than imitation training?

This explores why teaching a model to critique flawed answers builds deeper reasoning than teaching it to copy correct ones — and what 'deeper' actually means once you look at what each method transfers.


This explores why teaching a model to critique flawed answers builds deeper reasoning than teaching it to copy correct ones. The corpus has a sharp answer: imitation mostly transfers surface, while critique forces engagement with the machinery of reasoning. When you train a model on correct answers, what it picks up is often style and output format, not understanding. Models imitating ChatGPT learn to sound confident and fluent without closing any real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and instruction tuning turns out to teach the *shape* of the output space rather than the task itself — models trained on semantically empty or even wrong instructions perform almost as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. Imitation is a weak teacher because there's an easy shortcut: mimic the surface and you get rewarded.

Critique removes that shortcut. To critique a noisy response you have to engage with *why* it fails — the structural reasoning, the failure modes — and that engagement is what builds genuine understanding. Strikingly, even imperfect critique supervision beats correct-answer imitation Does critiquing errors teach deeper understanding than imitating correct answers?. The signal is so strong that critique fine-tuning on a *single* problem, using a teacher's critiques of varied solutions, can unlock reasoning comparable to full reinforcement learning — exposure to correct-versus-incorrect on one problem is a sufficient activation signal Can a single problem unlock reasoning through solution critique?.

Here's the part you might not expect: the deeper benefit of critique shows up *during* training, not just at test time. Step-level critique in the training loop counteracts 'tail narrowing' — the way self-training tends to collapse onto a few solution patterns — and keeps the model's exploration diverse Do critique models improve diversity during training itself?. Imitation pushes a model to converge prematurely on what it's already seen; critique keeps the search space open so the model can keep discovering.

This connects to a broader theme in the corpus about what reasoning training actually transfers. Models learn the *logical architecture* of reasoning — how steps sequence and connect — far more than factual content: they shrug off 50% corrupted numbers but break when you shuffle the steps What do models actually learn from chain-of-thought training?. And training on messy *search processes*, including mistakes and backtracking, produces substantially better problem-solvers than training only on clean optimal trajectories, because the model internalizes how to explore rather than a fixed path Does training on messy search processes improve reasoning?. Critique belongs to this family: engaging with error and structure beats copying the polished final product.

Worth noting the corpus doesn't frame this as critique-versus-imitation winner-take-all. Sequencing matters — establishing reasoning foundations through imitation first, then sharpening against verifiable rewards, beats either alone, because the imitation phase creates reasonable rollouts that later exploration can refine Does sequencing imitation then exploration training improve reasoning?. Imitation isn't useless; it's a weak teacher of understanding but a fine way to bootstrap a starting point. The deeper lesson: understanding comes from grappling with what's wrong and why, and any training that hands the model only correct answers quietly lets it skip that work.


Sources 8 notes

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning researcher evaluating whether the critique-vs.-imitation gap still holds. The question: does training models to critique flawed answers genuinely produce deeper understanding than training on correct answers alone?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. A library of arXiv papers on LLM reasoning and training claims:
• Imitation training captures surface style and output distribution, not task understanding; even semantically wrong instructions yield near-identical performance (2023).
• Critique training, even on noisy responses, outperforms correct-answer imitation in depth of reasoning (2025, arXiv:2501.17703).
• Single-problem critique fine-tuning unlocks reasoning comparable to full RL by exposing models to correct-vs.-incorrect contrasts (2025, arXiv:2506.03295).
• Critique supervision during training counters "tail narrowing" and preserves exploration diversity, whereas imitation causes premature convergence (2025).
• Training on messy search processes (including mistakes, backtracking) beats clean optimal trajectories; models learn logical *structure*, not just content (2024–2025).
• Curriculum sequencing (imitation → RL refinement) outperforms either method alone (2024, arXiv:2402.05808).

Anchor papers (verify; mind their dates):
• arXiv:2305.15717 (2023): The False Promise of Imitating Proprietary LLMs
• arXiv:2501.17703 (2025): Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
• arXiv:2506.03295 (2025): Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem
• arXiv:2508.01191 (2025): Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether newer models (e.g., o1, latest Gemini, Claude 4), scaling (∞ tokens), new training methods (DPO variants, outcome-based RL), or evals (harder benchmarks) have relaxed or overturned the critique advantage. Separate the durable claim (critique forces engagement with structure) from the perishable limitation (imitation transfers only style—does this still hold with larger, better-regularized models?). Cite what shifted it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Does any recent paper show imitation catches up, or that critique's advantage is dataset/scale-dependent?
(3) Propose 2 research questions that assume the regime has moved: e.g., does critique's value persist when imitation uses contrastive or process-level signals? Do very large models need critique at all?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines