INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How should agents manage informati…›How do we evaluate AI systems when…›this inquiring line

Neither 'did users enjoy the explanation?' nor 'can they predict the AI?' reliably tells you if the explanation actually worked.

Should explanation quality be measured by user satisfaction or behavior prediction?

This explores a forced choice — do we judge an AI explanation good because the person liked it, or because it lets them predict what the model will do next — and the corpus argues both are flawed proxies for the thing that actually matters.

This reads the question as a contest between two yardsticks: user satisfaction versus behavior prediction. The corpus's sharpest move is to show that both can fail, and fail in opposite directions, so picking one over the other is the wrong frame. Satisfaction is the easier metric to game. Work on STORM finds that people report being satisfied while remaining internally confused — especially when they don't know what they don't know — and that durable understanding tracks sustained engagement, not the happiness score collected right after Does user satisfaction actually measure cognitive understanding?. Worse, explanations that feel good can actively mislead: reasoning traces and post-hoc justifications make users accept AI answers whether or not those answers are correct, manufacturing false trust Do explanations actually help users spot AI mistakes?.

But behavior prediction has its own trap. The counterfactual-simulatability research is the surprise here: explanations humans rate as correct and coherent routinely fail to predict how the model behaves on slightly altered inputs, and plausibility turns out to be uncorrelated with predictive accuracy Can LLM explanations actually help humans predict model behavior?. Crucially, RLHF makes explanations more convincing without making them more predictive — optimizing for satisfaction literally widens the gap, leaving users confident and wrong. So the two metrics aren't just different lenses on one quality; optimizing the first can degrade the second.

The more interesting answer the corpus offers is to stop measuring the explanation in isolation. One line of work reframes explainability as a communication problem: quality lives in the triad of who presents the explanation, how it's framed, and what the recipient is trying to do — not as an intrinsic property you can score once What if XAI is fundamentally a communication problem?. A study of 399 everyday explanations reinforces this, showing understanding is co-constructed through dialogue moves rather than delivered as a monologue — which is exactly what current one-shot LLM explanations get wrong What makes explanations work in real conversation?.

If you want a usable target instead of a binary, the corpus keeps pointing at the same one: does the explanation help the user do the right thing? That's why dual, contrastive explanations — arguing both for and against the answer — are the only kind shown to actually improve a person's ability to catch AI mistakes Do explanations actually help users spot AI mistakes?. And the RecExplainer line suggests you don't have to choose at all: it trains an LLM surrogate with separate behavior-alignment (matching outputs) and intention-alignment (reading internal states), then hybridizes them so the explanation is both faithful to the model and intelligible to the person Can LLMs explain recommenders by mimicking their internal states?.

The thing you might not have expected to learn: satisfaction and prediction aren't endpoints of one scale — they're closer to faithfulness and intelligibility, two requirements that can trade off against each other. The explanations worth building for hold both at once, and the cleanest way to detect a failure is behavioral but adversarial — can the user spot the model's error — rather than either a smile or a raw prediction score.

Sources 6 notes

Does user satisfaction actually measure cognitive understanding?

STORM shows users express satisfaction despite internal confusion, especially when unaware of knowledge gaps. Sustained engagement correlates with actual self-understanding, not immediate satisfaction ratings.

Do explanations actually help users spot AI mistakes?

Reasoning traces and post-hoc explanations increase user acceptance of AI answers regardless of correctness, engendering false trust. Only dual explanations presenting arguments for and against the answer genuinely help users distinguish correct from incorrect outputs.

Can LLM explanations actually help humans predict model behavior?

Explanations that humans judge as correct and coherent fail to predict model behavior on counterfactuals. RLHF optimization improves how convincing explanations seem without improving their actual predictive accuracy, leaving users confident but wrong.

What if XAI is fundamentally a communication problem?

Explanation quality is not intrinsic to the explanation itself but depends on the rhetorical situation: who presents it, how it is framed, and what role the recipient plays. Evaluations that ignore this triad measure only a narrow slice of real-world effectiveness.

What makes explanations work in real conversation?

Analysis of 399 daily-life explanations shows that topic relation, dialogue act, and explanation move jointly predict understanding success. Explanations are co-constructed through interaction patterns, not monological delivery—challenging how LLMs currently generate explanations.

Show all 6 sources

Can LLMs explain recommenders by mimicking their internal states?

RecExplainer trains LLMs via three alignment methods: behavior (mimicking outputs), intention (incorporating neural embeddings), and hybrid (combining both). The hybrid approach produces explanations that are simultaneously faithful to the target model and intelligible to users by balancing internal-state inspection with human-readable reasoning.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an XAI researcher re-evaluating the satisfaction–vs–prediction framing for explanation quality. The question: should we measure explanation quality by user satisfaction or behavior prediction?

What a curated library found — and when (dated claims, not current truth):
These findings span July 2023 to May 2026; treat them as perishable constraints.

• User satisfaction systematically diverges from internal cognitive clarity and sustained engagement; people report satisfaction while remaining confused about what they don't know (~2023).
• Post-hoc reasoning traces and dual explanations increase false trust: users accept incorrect AI answers more readily, especially when explanations feel plausible (~2023–2024).
• Counterfactual simulatability of LLM explanations is low and uncorrelated with perceived accuracy; explanations that seem correct fail to predict model behavior on altered inputs (~2023).
• RLHF optimizes explanations for convincingness without improving predictive fidelity, widening the satisfaction–accuracy gap (~2024).
• Only contrastive dual explanations (arguing both for and against) improve users' ability to catch AI errors; one-shot monologue explanations fail (~2024).
• Explanation quality is situational, not intrinsic: it depends on presenter, framing, and recipient intent; dialogue improves understanding more than static delivery (~2024).

Anchor papers (verify; mind their dates):
• 2307.08678 (counterfactual simulatability, Jul 2023)
• 2311.10947 (RecExplainer hybrid alignment, Nov 2023)
• 2403.00662 (dialogical explanation modeling, Mar 2024)
• 2605.10930 (false trust from LLM explanations, May 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, check whether post-2026 LLM capabilities, training methods (e.g., process reward models, chain-of-thought verification), tooling (e.g., structured explanation formats, multi-turn harnesses), or evaluation paradigms have relaxed or overturned the satisfaction–prediction divorce. Separate the durable tension (likely still present) from any resolved bottleneck; cite what resolved it. Where does the gap still hold?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper shown satisfaction and prediction can align under certain design constraints or model scales? Flag disagreement.
(3) Propose 2 research questions that ASSUME the regime has shifted: e.g., if multi-turn dialogue or process-level transparency has narrowed the gap, how do we measure robustness of that alignment across distribution shifts?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Neither 'did users enjoy the explanation?' nor 'can they predict the AI?' reliably tells you if the explanation actually worked.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8