SYNTHESIS NOTE

Topics›Reasoning by Reflection›this note

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

Supervised Fine-Tuning trains models to maximize the probability of a correct response given an instruction. Critique Fine-Tuning (CFT) trains models to maximize the probability of a high-quality critique given an instruction plus a noisy (flawed) response. The training objective is P(critique | query, flawed_response). At inference time, the trained model generates direct responses in the normal way — no critique is invoked.

The advantage is mechanistic: to write a good critique, the model must understand the problem at a structural level — not just recognize the correct answer pattern but identify precisely what is wrong with a given response and why. This requires engaging with failure modes, understanding the criteria for correctness, and reasoning about deviations from those criteria. SFT can succeed by learning to recognize the surface form of correct answers. CFT cannot succeed by surface matching alone.

The training data is efficiently generated: GPT-4o produces critiques for query-noisy-response pairs at scale. The cost is that at least 20% of critiques contain errors (acknowledged limitation). But even imperfect critique supervision outperforms correct-response imitation, which reveals how weak the imitation objective is at building understanding.

The key limitation is illuminating: CFT-trained models can critique other models' outputs but do not develop self-critique capability. The training objective creates a competence asymmetry — better at evaluating others, not better at evaluating themselves. This is consistent with Why do models trust their own generated answers?: the self-trust structural bias persists even after extensive critique training on others' outputs.

This connects to Does chain-of-thought reasoning reveal genuine inference or pattern matching?: both identify the same SFT failure mode. CFT addresses the root: instead of training on correct form, train on structured failure analysis.

Inquiring lines that read this note 32

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does self-reflection enable models to reliably correct their errors?

Can ensemble evaluation methods reduce bias more than single judges?

How does execution-guided critique differ from abstract action evaluation?

How do training data properties shape reasoning capability development?

How do training priors constrain what context information can override?

What factors beyond surface content determine how readers extract meaning differently?

What distinguishes genuine understanding from correct output without coherent principles?

What properties determine whether reward signals teach genuine reasoning?

What happens when confident wrong answers become more rewarded than uncertain correct ones?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What are collider structures and why do they reveal reasoning errors?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can high test performance mask a complete absence of understanding?

How do social dynamics and selection effects compound in rating aggregates?

How do self-generated feedback mechanisms enable effective model learning?

How should training incorporate external critique versus encouraging self-correction?

Why do readers trust citations and complexity regardless of accuracy?

Why does polished presentation substitute for deeper expert judgment?

Why does self-revision increase model confidence while degrading accuracy?

How do adversarial and manipulative prompts attack reasoning models?

How can AI systems learn from failures without cascading errors?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Does critique training improve exploration diversity during model training or only test time?

Can model confidence signals reliably improve reasoning quality and calibration?

What makes mathematically confident but incorrect answers resemble valid solution shapes?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 147 in 2-hop network ·medium cluster Open in graph ↗

Does critiquing errors teach deeper understandin… Does chain-of-thought reasoning reveal genuine inf… Why do models trust their own generated answers? Does supervised fine-tuning improve reasoning or j… Do critique models improve diversity during traini… Can adversarial critics replace task-specific veri… Can reasoning emerge from expert demonstrations al… Can reasoning improvement work without answer veri…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
SFT imitation is the failure; CFT is an alternative training objective that forces structural understanding over form imitation
Why do models trust their own generated answers? Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
CFT's self-critique limitation confirms structural self-trust bias persists even when critique competence is developed for other-model evaluation
Does supervised fine-tuning improve reasoning or just answers? Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
CFT is the counter-strategy: instead of training on correct answer form (which raises scores without understanding), CFT trains on structured failure analysis (which requires understanding)
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
complementary critique mechanism: AutoMathCritique uses critique to improve training-time exploration diversity; CFT uses critique-writing as the training signal itself; both treat critique as more than test-time quality filter
Can adversarial critics replace task-specific verifiers for reasoning? Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
parallel mechanism: RARO's adversarial critic forces genuine reasoning for the same reason CFT's critique objective does — discriminating expert from policy requires structural understanding, not surface pattern matching; both bypass pure imitation
Can reasoning emerge from expert demonstrations alone? Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.
RARO's co-trained critic operationalizes the critique principle via adversarial RL: the critic component develops evaluation capability through the same structural-understanding mechanism that makes CFT work, but in a joint training loop rather than a separate training objective
Can reasoning improvement work without answer verification? Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.
VeriFree extends critique-based training to domains without verifiers: where CFT trains on structured critique of flawed responses, VeriFree conditions on reference answer likelihood to create reward signal without explicit verification — both bypass the requirement for deterministic answer checking that limits standard RL to math/code

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training to critique noisy responses produces deeper understanding than training to imitate correct responses

Does critiquing errors teach deeper understanding than imitating correct answers?

Inquiring lines that read this note 32

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4