SYNTHESIS NOTE

Can LLM explanations actually help humans predict model behavior?

Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Do Models Explain Themselves?" introduces a rigorous evaluation framework for model explanations: can the explanation help a human predict what the model would do on related but different inputs? If a model answers "yes" to "Can eagles fly?" with the explanation "all birds can fly," then a human would infer it also answers "yes" to "Can penguins fly?" If the model actually says "no," the explanation was imprecise — it gave the human a wrong mental model.

Two metrics operationalize this:

Simulation precision: the fraction of counterfactuals where human inference (from the explanation) matches the model's actual output
Simulation generality: the diversity of counterfactuals relevant to the explanation

The key finding: precision does not correlate with plausibility. Explanations that humans judge as factually correct and logically coherent do NOT enable accurate prediction of model behavior. This means RLHF — which optimizes for human approval of explanations — will improve plausibility (explanations that look good) without improving precision (explanations that predict behavior). The model learns to generate explanations humans like, not explanations humans can use.

The second finding reinforces this: GPT-4 approximates human simulators with comparable inter-annotator agreement, and its agreement with humans is sometimes higher than human-human agreement. This validates GPT-4 as a precision evaluator but also underscores that the precision problem is not a measurement issue — it is genuine.

The implication for the CoT-as-explanation paradigm is severe. The entire interpretability case for chain-of-thought rests on the assumption that reading the trace helps users understand how the model works. But if explanation precision is low, users build incorrect mental models from CoT. Since Do chain-of-thought traces actually help users understand model reasoning?, optimizing for better-looking traces (via RLHF) will make the mental model problem worse, not better — users will be more confident in less accurate predictions.

The satisfaction-vs-faithfulness mechanism makes the RLHF prediction concrete. This note argues RLHF improves plausibility without improving precision; a user study ("Evaluating the False Trust Engendered by LLM Explanations") names the causal pathway. It draws on the finding that satisfaction — leaving the user feeling they understand the AI's reasoning — is a key property of explanations in human-AI interaction, and that RLHF-optimized models excel at producing helpful, warm, satisfying responses. Post-hoc explanations arguing for an answer's correctness therefore engender high false trust and hamper users' ability to distinguish correct from incorrect outputs, plausibly because the same RLHF optimization that drives sycophancy drives explanations that please rather than predict. So "RLHF improves plausibility not precision" is not just a metric uncorrelation but a behavioral consequence: optimizing explanations for user satisfaction is what produces persuasive-but-uninformative explanations, the precise failure this note attributes to low counterfactual simulatability.

Source (enrichment): Flaws — "Evaluating the False Trust Engendered by LLM Explanations", https://arxiv.org/abs/2605.10930

Inquiring lines that read this note 3

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models develop causal world models or rely on statistical patterns?

Do LLMs need world models to make accurate predictions?

How do we evaluate AI systems when user perception misleads actual performance?

Should explanation quality be measured by user satisfaction or behavior prediction?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Why don't LLM explanations predict what models would actually do?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 177 in 2-hop network ·medium cluster Open in graph ↗

Can LLM explanations actually help humans predic… Does chain of thought reasoning actually explain m… Do chain-of-thought traces actually help users und… Do users worldwide trust confident AI outputs even… Can we detect memorable moments by observing emoti… Do explanations actually help users spot AI mistak…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does chain of thought reasoning actually explain model decisions? When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
weak correlation between CoT quality and output quality is the production-system version of low counterfactual simulatability
Do chain-of-thought traces actually help users understand model reasoning? Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.
decoupled objectives: precision ≠ plausibility is the metric-level evidence for this architectural claim
Do users worldwide trust confident AI outputs even when wrong? Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
users trust plausible explanations the same way they trust confident outputs; both fail prediction
Can we detect memorable moments by observing emotional expressions? Emotion recognition systems assume that detecting emotional moments will identify what people remember. But does observed emotion in group settings actually predict individual memorability, or does the proxy fail?
analogous proxy failure: plausible-looking explanations don't predict actual understanding, just as emotional-looking moments don't predict actual memorability; both demonstrate that observable surface features diverge from the functional process they are assumed to index
Do explanations actually help users spot AI mistakes? Most AI explanations are designed to justify the system's answer, but do they help users distinguish correct from incorrect outputs? This research tests whether standard explanation formats genuinely improve error detection or just increase trust regardless of accuracy.
grounds: explanations gain plausibility without precision so acceptance rises without diagnosis

Can LLM explanations actually help humans predict model behavior?

Inquiring lines that read this note 3

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4