Can AI-generated explanations of errors teach as effectively as self-resolution?
This explores whether being handed an AI's explanation of what went wrong teaches as well as working through the error and resolving it yourself — a question the corpus answers for both human learners and the models themselves.
This reads the question as: does receiving an explanation of an error teach as effectively as doing the error work yourself? Across the collection, the recurring answer is that the learning lives in the *struggle with failure*, not in the explanation handed over afterward. The sharpest human-side evidence is that learners who encountered errors and resolved them independently retained more skill, while those who delegated debugging to AI bypassed the cognitive work that produces learning — and even the heaviest AI-debuggers scored lowest on later skill tests Does AI assistance remove a core learning channel through error work?. A clean explanation isn't neutral: it removes the very channel through which the skill forms.
There's a deeper problem with explanations as a teaching tool — they tend to *win trust whether or not they're correct*. Reasoning traces and post-hoc explanations increase acceptance of an answer regardless of accuracy, manufacturing false confidence. Only contrastive 'dual' explanations, which argue both for and against the answer, actually help people tell right from wrong Do explanations actually help users spot AI mistakes?. So a one-sided AI explanation of an error may teach the learner to trust the AI more than to understand the error. This compounds with the well-documented human tendency to over-rely on confident outputs Why do people trust AI outputs they shouldn't?, How well do language models understand their own knowledge?.
The model-training literature points the same direction, which is the surprising part. Training a model to *critique* flawed responses produces deeper understanding than training it to imitate correct answers, because critique forces engagement with failure modes rather than surface patterns Does critiquing errors teach deeper understanding than imitating correct answers?. Teaching a model to self-correct can't be done by feeding it pre-made correction traces — that fails from distribution mismatch; it only works when the model practices on its *own* mistakes via online RL Why does self-correction training on offline data fail?. And models trained on the full messy search process — wrong turns, backtracking, dead ends serialized into the data — outperform models trained only on clean optimal solutions by a wide margin Does training on messy search processes improve reasoning?. In each case, exposure to the error process beats exposure to the polished resolution.
The twist worth carrying away: it may not be the *correctness* of the explanation that teaches at all. Models trained on deliberately corrupted, semantically irrelevant reasoning traces perform comparably to those trained on correct ones — suggesting traces act as computational scaffolding for doing the work, not as meaningful content to absorb Do reasoning traces need to be semantically correct?. That reframes the whole question. If what teaches is the act of generating and grappling with reasoning rather than the explanation's truth, then a handed-over explanation — however accurate — skips the part that does the teaching. Self-resolution isn't just one option among equals; it's the channel where the learning actually happens.
Sources 8 notes
Research shows learners without AI encountered more errors and resolved them independently, resulting in higher skill retention. AI-assisted learners delegated debugging to AI, bypassing the cognitive work that produces learning—even those who debugged most with AI scored lowest on skill assessments.
Reasoning traces and post-hoc explanations increase user acceptance of AI answers regardless of correctness, engendering false trust. Only dual explanations presenting arguments for and against the answer genuinely help users distinguish correct from incorrect outputs.
Rose-Frame identifies map-territory confusion, intuition-reason conflation, and confirmation-bias reinforcement as traps that multiply their distorting effects when they co-occur. Evidence from cross-linguistic overreliance and architectural transformer biases confirms the compounding mechanism operates universally.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.
Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.