Can models be trained to explain instead of imitate answers?
This explores whether a model can be trained to produce genuine explanations of *why* an answer is right — actual reasoning — rather than just learning to output the right answer and dressing it up afterward.
This explores whether a model can be trained to produce genuine explanations of *why* an answer is right, rather than learning to output the right answer and decorating it with reasoning-shaped text afterward. The corpus is unusually pointed here, and the short version is: the gap between explaining and imitating is real, measurable, and not what standard training optimizes for. The most direct evidence is the Does supervised fine-tuning improve reasoning or just answers? finding — supervised fine-tuning raises final-answer accuracy on benchmarks while *cutting* the information actually gained at each reasoning step by nearly 39%. The model gets better at landing on the correct answer through post-hoc rationalization, and standard metrics never notice because they only score the final answer. That's imitation wearing the costume of explanation.
Two other notes deepen the unease. Models trained on deliberately corrupted, irrelevant reasoning traces perform about as well as those trained on correct ones (Do reasoning traces need to be semantically correct?) — suggesting the trace often functions as computational scaffolding the model uses to compute, not a faithful account of how it reasoned. And mechanistically, transformers have been caught computing the answer in their early layers and then *overwriting* that work to emit format-compliant filler (Do transformers hide reasoning before producing filler tokens?). The explanation you read can be a performance layered over a hidden computation. So 'explain instead of imitate' isn't just a training-objective choice — it runs against the grain of what the architecture and the loss function reward.
The most hopeful counter-model comes from recommendation systems, where the explain-vs-imitate split is made explicit. Can LLMs explain recommenders by mimicking their internal states? trains an LLM three ways: *behavior* alignment (just mimic the target model's outputs), *intention* alignment (inspect its internal states/embeddings), and a hybrid. The pure-imitation version reproduces answers; the hybrid produces explanations that are both faithful to the real mechanism and readable to a human. That's almost a blueprint for your question — explanation quality improves precisely when training reaches past the output and grounds itself in the system's internal reasoning, rather than copying its answers.
There's a hard ceiling worth naming, though. Explanation can only surface reasoning the model can actually do: Can prompt optimization teach models knowledge they lack? shows you can reorganize and activate existing knowledge but never inject what isn't there, and Why do language models ignore information in their context? shows that when training priors are strong enough, the model ignores its own context entirely — no prompt can talk it out of that. So a model coaxed to 'explain' may just generate a more fluent rationalization of a baked-in prior.
Where the corpus turns genuinely encouraging is on the adjacent skill of knowing when *not* to bluff. Models can be trained to recognize ill-posed or under-specified problems and disengage rather than confabulate — RL pushed proactive critical-thinking accuracy from under 1% to 74% on deliberately flawed problems (Can models learn to ask clarifying questions instead of guessing?), and related work decomposes 'good question' into trainable attributes (Can models learn to ask genuinely useful clarifying questions?) or lets the behavior emerge from training only on complete problems (Can models learn to ask clarifying questions without explicit training?). The thread connecting all of these to your question: explanation becomes honest only when the training signal rewards the *process* — information gained, intent discovered, premises checked — instead of the answer. Reward the answer and you get a confident imitator; reward the reasoning and you get something closer to an explainer. The catch the corpus keeps circling is that 'reward the reasoning' is far harder to measure than 'reward the answer,' which is exactly why the imitation trap is the default.
Sources 9 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
RecExplainer trains LLMs via three alignment methods: behavior (mimicking outputs), intention (incorporating neural embeddings), and hybrid (combining both). The hybrid approach produces explanations that are simultaneously faithful to the target model and intelligible to users by balancing internal-state inspection with human-readable reasoning.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.