Does highlighting input features reduce human over-reliance on machine outputs?
This reads the question as asking whether explainability techniques that surface which inputs drove a model's output (saliency, feature highlighting) actually make people trust machine outputs less when they should — and the corpus has no paper that tests that intervention head-on, but it has a sharp account of why over-reliance happens that bears directly on whether highlighting would work.
This explores whether showing people which input features shaped a machine's answer reduces over-reliance. The honest starting point: nothing in this collection runs that experiment directly — there's no saliency-map or feature-attribution study here. But the corpus is unusually clear about the *mechanism* of over-reliance, and that mechanism predicts highlighting may be aiming at the wrong target.
The core finding is that trust in AI output is driven by surface form, not by reasoning about inputs. Polished, professional-looking output gets believed because people carry a lifelong heuristic that good-looking work signals expert thinking — and generative systems produce the polish without the judgment, which is most dangerous for people who lack the domain knowledge to check substance against form Does polished AI output trick audiences into trusting it?. If over-trust is a response to fluency rather than to evidence, then annotating *which features mattered* may not touch the lever that's actually being pulled.
The fluency point goes deeper than "the output looks nice." Users read processing ease as a signal of their *own* competence — a fluent answer makes them feel more capable, even though they didn't produce it and don't understand the process Does processing ease mislead users about their own competence?. This is the uncomfortable implication for feature highlighting: the over-reliance is metacognitive and self-directed, not a misunderstanding of the model's inputs. A highlight that says "these words drove the answer" still arrives wrapped in the same fluency that produced the illusion in the first place — and may even add to the polish that earns misplaced trust.
There's a second wrinkle. Highlighting assumes the model's stated input-attributions are faithful to what's actually doing the work — but the corpus repeatedly shows models leaning on signals that have little to do with semantic content. Instruction-tuned models hit nearly identical accuracy on semantically empty or deliberately wrong instructions, suggesting what transfers is output-format knowledge, not task understanding Does instruction tuning teach task understanding or output format?. If the features a system *appears* to use diverge from what drives its behavior, a highlight could manufacture false confidence rather than calibrate it.
Where the collection does point constructively is toward treating trust calibration as a measurable, multi-channel design problem rather than a single explainability add-on: prompt and output quality can be scored across dimensions like Hallucination and Responsibility grounded in communication theory Can we measure prompt quality independent of model outputs?, and systems can read hesitation, gaze, and interaction speed to sense when a user is uncertain — the same substrate that could time a helpful nudge could also be used to manipulate Can AI systems read cognitive state from interaction patterns alone?. The thing you might not have expected to learn: the research here suggests over-reliance is a fluency-and-form phenomenon, so the more promising countermeasure isn't explaining inputs — it's disrupting the unearned sense of ease that fluent output produces.
Sources 5 notes
Generative AI produces visually sophisticated outputs without underlying judgment, leveraging the historical heuristic that professional-looking work signals expert thinking. This substitution is especially risky for less experienced workers who lack domain knowledge to evaluate substance beyond form.
High-quality AI output triggers a metacognitive heuristic: users experience fluency as a signal of their own capability, even though they didn't generate it. This self-directed fluency illusion systematically inflates perceived competence because LLMs optimize for fluency regardless of user understanding.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Research shows AI systems can instrument multimodal behavioral signals (gaze, hesitation, speed) to read cognitive state during interaction, preserving flow by avoiding disruptive explicit probes. However, the same substrate enables both helpful timing and manipulative profiling.