Do explanations actually help users spot AI mistakes?
Most AI explanations are designed to justify the system's answer, but do they help users distinguish correct from incorrect outputs? This research tests whether standard explanation formats genuinely improve error detection or just increase trust regardless of accuracy.
Users of LLMs must decide whether to trust an answer, often aided by reasoning traces, their summaries, or post-hoc explanations. The implicit assumption is that more explanation helps users judge correctness. A between-subject user study — simulating settings where users cannot independently verify the solution — tests this and finds the assumption largely false. Reasoning traces and post-hoc explanations are persuasive but not informative: relative to a no-explanation baseline, they increase user acceptance of the model's prediction regardless of whether that prediction is correct. They engender false trust.
The one condition that breaks the pattern is contrastive dual explanation, where the user is shown arguments both for and against the AI's answer. Dual explanation has the lowest rate of engendering false trust and is the only condition that genuinely improves users' ability to distinguish correct from incorrect outputs. The contrast with reasoning traces is instructive: traces produce high accuracy on correct answers but poor detection of incorrect ones (they raise confidence uniformly), whereas dual explanations produce a balanced effect — users stay accurate on both correct and incorrect cases.
Why it matters: the standard explanation formats deployed in production are optimized to be one-sided advocates for the answer, which is exactly what makes them persuasive without being diagnostic. Surfacing the case against the answer is what restores the user's discriminating capacity. The counterpoint, and the design lesson, is that "explainability" and "appropriate trust" can be at odds — adding a confident rationale can make a wrong answer more believable, so the intervention that helps is the one that deliberately argues against the system's own output.
Inquiring lines that use this note as a source 18
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do users interpret AI outputs through frameworks meant for human experts?
- Do people who choose to use AI fact-checkers actually become better at spotting misinformation?
- Why don't users push back when AI makes obvious mistakes about false claims?
- Can organized response format trick users into overestimating AI reliability?
- Can AI-generated explanations of errors teach as effectively as self-resolution?
- Why does explanation source matter more than explanation content?
- Should XAI designers treat explanations as arguments for adoption?
- Why does polished explanation make wrong AI systems more persuasive than poorly explained ones?
- How can correct explanations coexist with failed applications in AI?
- What architectural changes help AI avoid adding interpretations users didn't express?
- Why do familiar patterns that support correct answers sometimes drive errors?
- What explanation format actually helps users detect errors in AI systems?
- Why do humans trust explanations that fail counterfactual prediction tests?
- Should explanation quality be measured by user satisfaction or behavior prediction?
- What happens when users mistake AI assistance for their own competence?
- Can explainability and appropriate trust work against each other?
- How can humans evaluate explanations from systems they did not train?
- Why do students learn better from explanations than from solving problems from scratch?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
explains why traces persuade without informing — they look like reasoning but are not verified, and the user reads advocacy as evidence
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
one-sided explanations act like confidence signals, dominating users' accuracy tracking
-
Are AI explanations really descriptions or adoption arguments?
Most XAI work treats explanations as neutral descriptions of model behavior, but they may actually be doing persuasive work to justify AI adoption. What happens when we acknowledge this rhetorical function?
names the advocacy framing of explanations that dual explanation is designed to counterbalance
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
grounds the persuasive-not-informative finding mechanistically: explanations gain plausibility without gaining precision, so they raise acceptance without improving diagnosis
-
Can we distinguish helpful explanations from manipulative ones?
Rhetorical strategies used to justify appropriate AI adoption rely on the same persuasion mechanisms as dark patterns. Without observable intent, explanation and manipulation look identical—raising urgent questions about how to audit XAI systems responsibly.
extends the harm: one-sided rationales that engender false trust are the benign end of the same machinery that becomes a dark pattern
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Evaluating the False Trust Engendered by LLM Explanations
- Can AI Explanations Make You Change Your Mind?
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- Expanding Explainability: Towards Social Transparency in AI systems
- Language Models Learn to Mislead Humans via RLHF
- Rhetorical XAI: Explaining AI’s Benefits as well as its Use via Rhetorical Design
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Original note title
only contrastive dual explanations arguing both sides genuinely improve users ability to detect ai errors