Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
"Do Models Explain Themselves?" introduces a rigorous evaluation framework for model explanations: can the explanation help a human predict what the model would do on related but different inputs? If a model answers "yes" to "Can eagles fly?" with the explanation "all birds can fly," then a human would infer it also answers "yes" to "Can penguins fly?" If the model actually says "no," the explanation was imprecise — it gave the human a wrong mental model.
Two metrics operationalize this:
- Simulation precision: the fraction of counterfactuals where human inference (from the explanation) matches the model's actual output
- Simulation generality: the diversity of counterfactuals relevant to the explanation
The key finding: precision does not correlate with plausibility. Explanations that humans judge as factually correct and logically coherent do NOT enable accurate prediction of model behavior. This means RLHF — which optimizes for human approval of explanations — will improve plausibility (explanations that look good) without improving precision (explanations that predict behavior). The model learns to generate explanations humans like, not explanations humans can use.
The second finding reinforces this: GPT-4 approximates human simulators with comparable inter-annotator agreement, and its agreement with humans is sometimes higher than human-human agreement. This validates GPT-4 as a precision evaluator but also underscores that the precision problem is not a measurement issue — it is genuine.
The implication for the CoT-as-explanation paradigm is severe. The entire interpretability case for chain-of-thought rests on the assumption that reading the trace helps users understand how the model works. But if explanation precision is low, users build incorrect mental models from CoT. Since Do chain-of-thought traces actually help users understand model reasoning?, optimizing for better-looking traces (via RLHF) will make the mental model problem worse, not better — users will be more confident in less accurate predictions.
The satisfaction-vs-faithfulness mechanism makes the RLHF prediction concrete. This note argues RLHF improves plausibility without improving precision; a user study ("Evaluating the False Trust Engendered by LLM Explanations") names the causal pathway. It draws on the finding that satisfaction — leaving the user feeling they understand the AI's reasoning — is a key property of explanations in human-AI interaction, and that RLHF-optimized models excel at producing helpful, warm, satisfying responses. Post-hoc explanations arguing for an answer's correctness therefore engender high false trust and hamper users' ability to distinguish correct from incorrect outputs, plausibly because the same RLHF optimization that drives sycophancy drives explanations that please rather than predict. So "RLHF improves plausibility not precision" is not just a metric uncorrelation but a behavioral consequence: optimizing explanations for user satisfaction is what produces persuasive-but-uninformative explanations, the precise failure this note attributes to low counterfactual simulatability.
Source (enrichment): Flaws — "Evaluating the False Trust Engendered by LLM Explanations", https://arxiv.org/abs/2605.10930
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
weak correlation between CoT quality and output quality is the production-system version of low counterfactual simulatability
-
Do chain-of-thought traces actually help users understand model reasoning?
Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
decoupled objectives: precision ≠ plausibility is the metric-level evidence for this architectural claim
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
users trust plausible explanations the same way they trust confident outputs; both fail prediction
-
Can we detect memorable moments by observing emotional expressions?
Emotion recognition systems assume that detecting emotional moments will identify what people remember. But does observed emotion in group settings actually predict individual memorability, or does the proxy fail?
analogous proxy failure: plausible-looking explanations don't predict actual understanding, just as emotional-looking moments don't predict actual memorability; both demonstrate that observable surface features diverge from the functional process they are assumed to index
-
Do explanations actually help users spot AI mistakes?
Most AI explanations are designed to justify the system's answer, but do they help users distinguish correct from incorrect outputs? This research tests whether standard explanation formats genuinely improve error detection or just increase trust regardless of accuracy.
grounds: explanations gain plausibility without precision so acceptance rises without diagnosis
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Rethinking Large Language Models in Mental Health Applications
- Evaluating the False Trust Engendered by LLM Explanations
- Large Language Model Reasoning Failures
- Measuring Faithfulness in Chain-of-Thought Reasoning
- Eliciting Reasoning in Language Models with Cognitive Tools
- Word Meanings in Transformer Language Models
Original note title
counterfactual simulatability of llm explanations is low and uncorrelated with plausibility — rlhf cannot fix explanation precision