INQUIRING LINE

Does highlighting input features reduce human over-reliance on machine outputs?

This reads the question as asking whether explainability techniques that surface which inputs drove a model's output (saliency, feature highlighting) actually make people trust machine outputs less when they should — and the corpus has no paper that tests that intervention head-on, but it has a sharp account of why over-reliance happens that bears directly on whether highlighting would work.


This explores whether showing people which input features shaped a machine's answer reduces over-reliance. The honest starting point: nothing in this collection runs that experiment directly — there's no saliency-map or feature-attribution study here. But the corpus is unusually clear about the *mechanism* of over-reliance, and that mechanism predicts highlighting may be aiming at the wrong target.

The core finding is that trust in AI output is driven by surface form, not by reasoning about inputs. Polished, professional-looking output gets believed because people carry a lifelong heuristic that good-looking work signals expert thinking — and generative systems produce the polish without the judgment, which is most dangerous for people who lack the domain knowledge to check substance against form Does polished AI output trick audiences into trusting it?. If over-trust is a response to fluency rather than to evidence, then annotating *which features mattered* may not touch the lever that's actually being pulled.

The fluency point goes deeper than "the output looks nice." Users read processing ease as a signal of their *own* competence — a fluent answer makes them feel more capable, even though they didn't produce it and don't understand the process Does processing ease mislead users about their own competence?. This is the uncomfortable implication for feature highlighting: the over-reliance is metacognitive and self-directed, not a misunderstanding of the model's inputs. A highlight that says "these words drove the answer" still arrives wrapped in the same fluency that produced the illusion in the first place — and may even add to the polish that earns misplaced trust.

There's a second wrinkle. Highlighting assumes the model's stated input-attributions are faithful to what's actually doing the work — but the corpus repeatedly shows models leaning on signals that have little to do with semantic content. Instruction-tuned models hit nearly identical accuracy on semantically empty or deliberately wrong instructions, suggesting what transfers is output-format knowledge, not task understanding Does instruction tuning teach task understanding or output format?. If the features a system *appears* to use diverge from what drives its behavior, a highlight could manufacture false confidence rather than calibrate it.

Where the collection does point constructively is toward treating trust calibration as a measurable, multi-channel design problem rather than a single explainability add-on: prompt and output quality can be scored across dimensions like Hallucination and Responsibility grounded in communication theory Can we measure prompt quality independent of model outputs?, and systems can read hesitation, gaze, and interaction speed to sense when a user is uncertain — the same substrate that could time a helpful nudge could also be used to manipulate Can AI systems read cognitive state from interaction patterns alone?. The thing you might not have expected to learn: the research here suggests over-reliance is a fluency-and-form phenomenon, so the more promising countermeasure isn't explaining inputs — it's disrupting the unearned sense of ease that fluent output produces.


Sources 5 notes

Does polished AI output trick audiences into trusting it?

Generative AI produces visually sophisticated outputs without underlying judgment, leveraging the historical heuristic that professional-looking work signals expert thinking. This substitution is especially risky for less experienced workers who lack domain knowledge to evaluate substance beyond form.

Does processing ease mislead users about their own competence?

High-quality AI output triggers a metacognitive heuristic: users experience fluency as a signal of their own capability, even though they didn't generate it. This self-directed fluency illusion systematically inflates perceived competence because LLMs optimize for fluency regardless of user understanding.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Can AI systems read cognitive state from interaction patterns alone?

Research shows AI systems can instrument multimodal behavioral signals (gaze, hesitation, speed) to read cognitive state during interaction, preserving flow by avoiding disruptive explicit probes. However, the same substrate enables both helpful timing and manipulative profiling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an analyst re-testing whether feature highlighting (saliency maps, input attribution) actually reduces human over-reliance on AI outputs. The question remains open: does surfacing which inputs drove a model's answer shift how much users trust it?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and converge on a counterintuitive mechanism:
• Over-trust is driven by *surface fluency* and output polish, not by reasoning about inputs—users misread processing ease as a signal of their own competence (2024–2025 work).
• Instruction-tuned models achieve near-identical accuracy on semantically empty or deliberately wrong instructions, suggesting they learn output-format distribution, not task understanding (arXiv:2305.11383, 2023).
• Highlighting model-stated feature attributions risks *manufacturing* false confidence if those attributions diverge from what actually drives behavior—a polished explanation layered atop fluent output compounds the problem (2026 synthesis).
• Trust calibration works better as a multi-channel design problem (prompt quality, hesitation detection, gaze/interaction speed) than as a single explainability widget (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023): Do Models Really Learn to Follow Instructions?
• arXiv:2504.16021 (2025): Navigating the State of Cognitive Flow—AI Interventions
• arXiv:2603.29025 (2026): Surface Heuristics Override Implicit Constraints
• arXiv:2604.14807 (2026): Misattribution in AI-Assisted Cognitive Workflows

Your task:
(1) **RE-TEST THE FLUENCY HYPOTHESIS.** Judge whether recent models, training methods (RLHF variants, consistency training), or UI/prompt design since mid-2026 have *decoupled* trust from processing ease. Has anything actually made users trust explanations of reasoning over the speed of output? Separate "highlight features" (likely still surface-level) from "break fluency illusion" (the durable problem).
(2) **Surface the strongest work disagreeing with the fluency-over-reasoning claim.** Look for papers arguing that feature attribution, interpretability, or transparency *does* shift reliance behavior, especially if published in the last 6 months.
(3) **Propose two research questions that assume the regime has moved:** (a) If fluency is the lever, how do you design explanations that *disrupt* rather than amplify polish? (b) Can multimodal behavioral sensing (gaze, hesitation) detect over-reliance before it causes harm, and does real-time correction work?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines