SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Supervised Fine-Tuning trains models to maximize the probability of a correct response given an instruction. Critique Fine-Tuning (CFT) trains models to maximize the probability of a high-quality critique given an instruction plus a noisy (flawed) response. The training objective is P(critique | query, flawed_response). At inference time, the trained model generates direct responses in the normal way — no critique is invoked.

The advantage is mechanistic: to write a good critique, the model must understand the problem at a structural level — not just recognize the correct answer pattern but identify precisely what is wrong with a given response and why. This requires engaging with failure modes, understanding the criteria for correctness, and reasoning about deviations from those criteria. SFT can succeed by learning to recognize the surface form of correct answers. CFT cannot succeed by surface matching alone.

The training data is efficiently generated: GPT-4o produces critiques for query-noisy-response pairs at scale. The cost is that at least 20% of critiques contain errors (acknowledged limitation). But even imperfect critique supervision outperforms correct-response imitation, which reveals how weak the imitation objective is at building understanding.

The key limitation is illuminating: CFT-trained models can critique other models' outputs but do not develop self-critique capability. The training objective creates a competence asymmetry — better at evaluating others, not better at evaluating themselves. This is consistent with Why do models trust their own generated answers?: the self-trust structural bias persists even after extensive critique training on others' outputs.

This connects to Does chain-of-thought reasoning reveal genuine inference or pattern matching?: both identify the same SFT failure mode. CFT addresses the root: instead of training on correct form, train on structured failure analysis.

Inquiring lines that use this note as a source 32

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 140 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training to critique noisy responses produces deeper understanding than training to imitate correct responses