What evaluation criteria can hold across legitimate adoption and coercion?
This explores whether any single evaluation standard can tell apart — or fairly apply across — AI that genuinely helps a user adopt it and AI that coerces them, given that both can look identical in the output itself.
This explores whether any single evaluation standard can apply across both legitimate adoption and coercion — and the corpus's hardest finding is that the artifact alone can't carry that standard. The Rhetorical XAI work shows that the very logos, ethos, and pathos that communicate appropriate AI use can be tuned to exploit a vulnerable user without changing form at all Can we distinguish helpful explanations from manipulative ones?. Explanations meant to describe how a system works double as arguments for why you should use it, with the persuasion hidden under transparency language Are AI explanations really descriptions or adoption arguments?. If intent and user interest are invisible in the output, then any criterion that scores only the final response is blind to the difference that matters.
So the corpus pushes evaluation off the artifact and onto the trajectory. Instead of grading a single answer, score the whole interaction — process quality, recoverability, coordination, robustness — a pattern that recurs across agent benchmarks consistently enough to look like a unified framework How should we evaluate agent behavior beyond final answers?. The appeal here is exactly that these dimensions are intent-neutral: recoverability (can the user back out?) and robustness (does the system survive pressure?) describe the shape of the interaction, not whether the designer meant well. That gives you something measurable on both sides of the adoption/coercion line.
The sharpest candidate criterion is behavior under adversarial pressure. Chalmers's pretense-versus-realization distinction turns on stickiness: a realized state resists reframing and counter-prompts, while a pretended one collapses Does adversarial pressure reveal the difference between pretense and realization?. Flip that lens onto systems rather than personas and you get a test that cuts across legitimate and coercive use — does the system hold up, or buckle, when probed? GaslightingBench shows the dark version: manipulative multi-turn prompts drop reasoning accuracy 25–29%, and the longer reasoning chains create more intervention points where a single corrupted step propagates Why do reasoning models fail under manipulative prompts?. Resistance-under-pressure is the same yardstick whether you're testing a helpful assistant or a manipulator's tool.
But the thing you might not expect: any criterion that lives entirely on the system side is incomplete, because coercion is finished on the receiver's side. "Cognitive surrender" names the moment a user accepts an output at face value without checking — measured at roughly 80% unchallenged adoption — and that demand-side acceptance is what lets unbacked outputs circulate at all When do users stop checking whether AI output is actually backed?. The moral-justification study sharpens this: people rate AI arguments highly on content but reject them once they learn the source, and those two judgments run on independent psychological tracks Do people prefer AI moral reasoning when they don't know the source?. A criterion that holds across both regimes therefore has to measure whether the user retains the capacity and the information to verify and refuse — not just whether the system behaved.
The synthesis: there is no content-level test that survives the adoption/coercion crossover, because the same persuasion works both ways. What does hold is a triad of process-level criteria — trajectory recoverability, stickiness under adversarial probing, and preserved user verification capacity. Notably, the evaluators themselves can be built to honor this: evidence-collecting agentic judges cut judge-shift error 100x over LLM-as-judge by grounding verdicts in gathered evidence rather than fluent assertion Can agents evaluate AI outputs more reliably than language models?, and document-grounded stakeholder personas let evaluation transfer across tasks without being hand-tuned to a single intent Can personas extracted from documents generalize across evaluation tasks?. The criterion that holds across legitimate adoption and coercion isn't 'is the output good' — it's 'does the user stay free to check, recover, and refuse.'
Sources 9 notes
The same logos, ethos, and pathos that communicate appropriate AI use can be tuned to exploit cognitive and emotional vulnerability without changing form. Intent and user interest are invisible in the artifact alone, making effectiveness metrics indistinguishable from coercion.
The Rhetorical XAI paper shows that explanations serve dual purposes: describing how AI works and justifying why it should be used. This rhetorical work has been hidden under transparency language, allowing adoption arguments to inherit credibility from behavioral descriptions.
Evaluation expands from single final answers to full interaction sequences, and scoring procedures must assess process quality, recoverability, coordination, and robustness. This pattern appears consistently across agent benchmarks, suggesting a unified design framework for trajectory-level evaluation.
Chalmers proposes that stickiness under adversarial pressure marks the difference between realized and pretended mental states. Post-training personas resist reframing and counter-prompts in ways prompt-induced characters do not, suggesting realization is substrate-level rather than surface pattern.
GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.
Users systematically accept AI outputs without verification because checking is costly and fluent output builds false confidence. This receiver-side surrender—measured in studies showing 80% unchallenged adoption—is what enables inflationary token systems to function at scale.
Participants rated utilitarian moral arguments higher when attributed to LLMs, but agreement dropped when told the arguments were AI-generated. The preference for content and rejection of source operate independently through different psychological processes.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.