How does training on correct answer form differ mechanistically from training on failure analysis?
This explores the mechanistic split between two training signals: teaching a model what a correct answer looks like (imitating right outputs) versus teaching it to engage with what goes wrong (critiquing, filtering, or verifying failures) — and why those two produce different kinds of learning.
This explores the mechanistic split between two training signals — teaching a model what a correct answer *looks like* versus teaching it to engage with what *goes wrong* — and the corpus is surprisingly consistent that these are not two routes to the same destination. Training on correct answer form tends to teach surface structure, not understanding. When you imitate right answers, the model learns the *shape* of a good output and where the answer space lives, but not the reasoning that earns it. The cleanest evidence is that instruction tuning on semantically empty or deliberately wrong instructions still matches full correct-instruction training — what transfers is the output format distribution, not task comprehension Does instruction tuning teach task understanding or output format?. The same pattern shows up in chain-of-thought: logically *invalid* reasoning exemplars perform nearly as well as valid ones, because the model is absorbing the form of reasoning rather than doing inference Does logical validity actually drive chain-of-thought gains?, Why does chain-of-thought reasoning fail in predictable ways?.
Training on failure analysis works mechanistically differently because it forces the model to *engage with the gap* between an attempt and a correct outcome — and that gap is where structure lives. Critique fine-tuning, where the model is trained to find what's wrong with noisy responses, produces deeper understanding than imitating correct answers, and it does so even when the critique supervision is itself imperfect Does critiquing errors teach deeper understanding than imitating correct answers?. The reason is that a critique can't be satisfied by pattern-matching a familiar output shape; it has to locate a specific defect, which pushes learning toward reasoning rather than recall.
The distinction sharpens once you look at where errors actually occur. Scoring final answers misses most of what goes wrong, because the majority of failures are *process* violations — wrong intermediate states, not wrong conclusions. Adding verification of the reasoning process raised task success from 32% to 87%, which means the correct-answer signal was blind to the bulk of the failures Where do reasoning agents actually fail during long traces?. Failure analysis sees those; correct-form imitation can't.
A subtle wrinkle: failures aren't only useful as *targets to critique* — they're useful as *signal to keep around*. Asymmetric trajectory filtering keeps only high-quality positive examples but deliberately preserves diverse failures as negative signal, and that mix let a 14B model reach frontier math performance. Stripping the failures out would have removed exactly the contrast the model learns from Why do correct code trajectories teach models to tolerate errors?. This connects to why models are bad at self-correction in the first place: they structurally over-trust their own generated answers, so a training regime built only on what *looks* correct reinforces that bias, while one built on engaging failure breaks the self-agreement loop Why do models trust their own generated answers?.
The thing you might not expect: optimizing hard for correct answers can actively *erase* good reasoning behavior. Post-training that faithfully drives answer correctness simultaneously suppresses unmeasured traits like epistemic hedging and uncertainty-aware reasoning — the single objective creates blind spots where the stylistic features that help generalization go unprotected Can post-training objectives preserve reasoning style alongside correctness?. And because format shapes strategy far more than content does — multiple-choice training induces breadth-first search, free-form induces depth-first Does training data format shape reasoning strategy more than domain? — *how* you present correct answers quietly determines how the model thinks, which is a lever failure-analysis training engages with directly rather than by accident.
Sources 9 notes
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.