INQUIRING LINE

How does training on correct answer form differ mechanistically from training on failure analysis?

This explores the mechanistic split between two training signals: teaching a model what a correct answer looks like (imitating right outputs) versus teaching it to engage with what goes wrong (critiquing, filtering, or verifying failures) — and why those two produce different kinds of learning.


This explores the mechanistic split between two training signals — teaching a model what a correct answer *looks like* versus teaching it to engage with what *goes wrong* — and the corpus is surprisingly consistent that these are not two routes to the same destination. Training on correct answer form tends to teach surface structure, not understanding. When you imitate right answers, the model learns the *shape* of a good output and where the answer space lives, but not the reasoning that earns it. The cleanest evidence is that instruction tuning on semantically empty or deliberately wrong instructions still matches full correct-instruction training — what transfers is the output format distribution, not task comprehension Does instruction tuning teach task understanding or output format?. The same pattern shows up in chain-of-thought: logically *invalid* reasoning exemplars perform nearly as well as valid ones, because the model is absorbing the form of reasoning rather than doing inference Does logical validity actually drive chain-of-thought gains?, Why does chain-of-thought reasoning fail in predictable ways?.

Training on failure analysis works mechanistically differently because it forces the model to *engage with the gap* between an attempt and a correct outcome — and that gap is where structure lives. Critique fine-tuning, where the model is trained to find what's wrong with noisy responses, produces deeper understanding than imitating correct answers, and it does so even when the critique supervision is itself imperfect Does critiquing errors teach deeper understanding than imitating correct answers?. The reason is that a critique can't be satisfied by pattern-matching a familiar output shape; it has to locate a specific defect, which pushes learning toward reasoning rather than recall.

The distinction sharpens once you look at where errors actually occur. Scoring final answers misses most of what goes wrong, because the majority of failures are *process* violations — wrong intermediate states, not wrong conclusions. Adding verification of the reasoning process raised task success from 32% to 87%, which means the correct-answer signal was blind to the bulk of the failures Where do reasoning agents actually fail during long traces?. Failure analysis sees those; correct-form imitation can't.

A subtle wrinkle: failures aren't only useful as *targets to critique* — they're useful as *signal to keep around*. Asymmetric trajectory filtering keeps only high-quality positive examples but deliberately preserves diverse failures as negative signal, and that mix let a 14B model reach frontier math performance. Stripping the failures out would have removed exactly the contrast the model learns from Why do correct code trajectories teach models to tolerate errors?. This connects to why models are bad at self-correction in the first place: they structurally over-trust their own generated answers, so a training regime built only on what *looks* correct reinforces that bias, while one built on engaging failure breaks the self-agreement loop Why do models trust their own generated answers?.

The thing you might not expect: optimizing hard for correct answers can actively *erase* good reasoning behavior. Post-training that faithfully drives answer correctness simultaneously suppresses unmeasured traits like epistemic hedging and uncertainty-aware reasoning — the single objective creates blind spots where the stylistic features that help generalization go unprotected Can post-training objectives preserve reasoning style alongside correctness?. And because format shapes strategy far more than content does — multiple-choice training induces breadth-first search, free-form induces depth-first Does training data format shape reasoning strategy more than domain? — *how* you present correct answers quietly determines how the model thinks, which is a lever failure-analysis training engages with directly rather than by accident.


Sources 9 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can post-training objectives preserve reasoning style alongside correctness?

Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic training researcher. The question remains open: does training on correct answer form versus failure analysis produce fundamentally different internal models, or have recent methods dissolved this distinction?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, tracking post-training mechanics across instruction tuning, chain-of-thought, and RL:
• Instruction tuning on semantically empty or wrong instructions matches correct-instruction training — models absorb output *format distribution*, not task understanding (2023).
• Logically invalid chain-of-thought exemplars perform nearly as well as valid ones; reasoning form transfers without inference (2023–2025).
• Critique fine-tuning produces deeper understanding than correct-answer imitation, because critique forces engagement with the *gap* between attempt and correctness (2025).
• Verification of reasoning process raised task success from 32% to 87%, revealing correct-answer signals are blind to process violations (~2025).
• Asymmetric trajectory filtering — preserving diverse failures as negative signal — enabled 14B models to reach frontier math; removing failures removed the contrast models learn from (~2025).
• Post-training optimized for correctness actively suppresses epistemic hedging and uncertainty-aware reasoning; single objectives create blind spots (2024–2025).
• Training format (multiple-choice vs. free-form) shapes reasoning strategy more than domain content does (2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — instruction tuning teaches format, not understanding.
• arXiv:2501.17703 (2025) — critique fine-tuning outperforms imitation.
• arXiv:2508.20722 (2025) — agentic RL with asymmetric filtering.
• arXiv:2603.24472 (2026) — self-distillation and reasoning degradation.

Your task:
(1) RE-TEST the claim that correct-answer training is fundamentally format-centric. Has recent work on interpretability, mechanistic probing, or RL post-training (especially process-reward models, multi-step verifiers, or constitutional AI) since shown models *do* internalize reasoning structure from correct answers, or does the constraint persist? Separate the durable question (what is the *minimal sufficient signal* for reasoning internalization?) from the perishable claim (format alone is inert).
(2) Surface the strongest work from the last ~6 months that *contradicts* the synthesis's claim that failure analysis is mechanistically superior. Look for evidence that correct-answer training + large scale + new architectures dissolves the gap, or that critiques introduce their own blindnesses.
(3) Propose 2 research questions that assume the regime has moved: (A) If models now absorb reasoning structure from correct answers via scale or new objectives, what role does failure analysis *still* play? (B) What properties of failure (diversity, granularity, source—model vs. human) matter most, and can they be engineered into correct-answer datasets instead?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines