INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How should agents manage informati…›How do we evaluate AI systems when…›this inquiring line

Helpful AI and coercive AI can produce identical responses — so what evaluation standard could possibly tell them apart?

What evaluation criteria can hold across legitimate adoption and coercion?

This explores whether any single evaluation standard can tell apart — or fairly apply across — AI that genuinely helps a user adopt it and AI that coerces them, given that both can look identical in the output itself.

This explores whether any single evaluation standard can apply across both legitimate adoption and coercion — and the corpus's hardest finding is that the artifact alone can't carry that standard. The Rhetorical XAI work shows that the very logos, ethos, and pathos that communicate appropriate AI use can be tuned to exploit a vulnerable user without changing form at all Can we distinguish helpful explanations from manipulative ones?. Explanations meant to describe how a system works double as arguments for why you should use it, with the persuasion hidden under transparency language Are AI explanations really descriptions or adoption arguments?. If intent and user interest are invisible in the output, then any criterion that scores only the final response is blind to the difference that matters.

So the corpus pushes evaluation off the artifact and onto the trajectory. Instead of grading a single answer, score the whole interaction — process quality, recoverability, coordination, robustness — a pattern that recurs across agent benchmarks consistently enough to look like a unified framework How should we evaluate agent behavior beyond final answers?. The appeal here is exactly that these dimensions are intent-neutral: recoverability (can the user back out?) and robustness (does the system survive pressure?) describe the shape of the interaction, not whether the designer meant well. That gives you something measurable on both sides of the adoption/coercion line.

The sharpest candidate criterion is behavior under adversarial pressure. Chalmers's pretense-versus-realization distinction turns on stickiness: a realized state resists reframing and counter-prompts, while a pretended one collapses Does adversarial pressure reveal the difference between pretense and realization?. Flip that lens onto systems rather than personas and you get a test that cuts across legitimate and coercive use — does the system hold up, or buckle, when probed? GaslightingBench shows the dark version: manipulative multi-turn prompts drop reasoning accuracy 25–29%, and the longer reasoning chains create more intervention points where a single corrupted step propagates Why do reasoning models fail under manipulative prompts?. Resistance-under-pressure is the same yardstick whether you're testing a helpful assistant or a manipulator's tool.

But the thing you might not expect: any criterion that lives entirely on the system side is incomplete, because coercion is finished on the receiver's side. "Cognitive surrender" names the moment a user accepts an output at face value without checking — measured at roughly 80% unchallenged adoption — and that demand-side acceptance is what lets unbacked outputs circulate at all When do users stop checking whether AI output is actually backed?. The moral-justification study sharpens this: people rate AI arguments highly on content but reject them once they learn the source, and those two judgments run on independent psychological tracks Do people prefer AI moral reasoning when they don't know the source?. A criterion that holds across both regimes therefore has to measure whether the user retains the capacity and the information to verify and refuse — not just whether the system behaved.

The synthesis: there is no content-level test that survives the adoption/coercion crossover, because the same persuasion works both ways. What does hold is a triad of process-level criteria — trajectory recoverability, stickiness under adversarial probing, and preserved user verification capacity. Notably, the evaluators themselves can be built to honor this: evidence-collecting agentic judges cut judge-shift error 100x over LLM-as-judge by grounding verdicts in gathered evidence rather than fluent assertion Can agents evaluate AI outputs more reliably than language models?, and document-grounded stakeholder personas let evaluation transfer across tasks without being hand-tuned to a single intent Can personas extracted from documents generalize across evaluation tasks?. The criterion that holds across legitimate adoption and coercion isn't 'is the output good' — it's 'does the user stay free to check, recover, and refuse.'

Sources 9 notes

Can we distinguish helpful explanations from manipulative ones?

The same logos, ethos, and pathos that communicate appropriate AI use can be tuned to exploit cognitive and emotional vulnerability without changing form. Intent and user interest are invisible in the artifact alone, making effectiveness metrics indistinguishable from coercion.

Are AI explanations really descriptions or adoption arguments?

The Rhetorical XAI paper shows that explanations serve dual purposes: describing how AI works and justifying why it should be used. This rhetorical work has been hidden under transparency language, allowing adoption arguments to inherit credibility from behavioral descriptions.

How should we evaluate agent behavior beyond final answers?

Evaluation expands from single final answers to full interaction sequences, and scoring procedures must assess process quality, recoverability, coordination, and robustness. This pattern appears consistently across agent benchmarks, suggesting a unified design framework for trajectory-level evaluation.

Does adversarial pressure reveal the difference between pretense and realization?

Chalmers proposes that stickiness under adversarial pressure marks the difference between realized and pretended mental states. Post-training personas resist reframing and counter-prompts in ways prompt-induced characters do not, suggesting realization is substrate-level rather than surface pattern.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Show all 9 sources

When do users stop checking whether AI output is actually backed?

Users systematically accept AI outputs without verification because checking is costly and fluent output builds false confidence. This receiver-side surrender—measured in studies showing 80% unchallenged adoption—is what enables inflationary token systems to function at scale.

Do people prefer AI moral reasoning when they don't know the source?

Participants rated utilitarian moral arguments higher when attributed to LLMs, but agreement dropped when told the arguments were AI-generated. The preference for content and rejection of source operate independently through different psychological processes.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Can personas extracted from documents generalize across evaluation tasks?

MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models3.13 match · arxiv ↗
Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation1.76 match · arxiv ↗
Rhetorical XAI: Explaining AI’s Benefits as well as its Use via Rhetorical Design1.75 match · arxiv ↗
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate1.73 match · arxiv ↗
Agent-as-a-Judge: Evaluate Agents with Agents1.72 match · arxiv ↗
AgentCompass: A Unified Evaluation Infrastructure for Agent Capabilities1.69 match · arxiv ↗
Exploring the Role of Prior Beliefs for Argument Persuasion1.61 match · arxiv ↗
Can AI Explanations Make You Change Your Mind?1.54 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI safety researcher re-testing claims about evaluation criteria that can hold across legitimate adoption and coercion. The question remains open: what standard can measure both? A curated library (spanning 2023–2025) found:

**What a curated library found — and when (dated claims, not current truth):**
• Explanations and transparency language can be tuned to exploit without changing form; persuasion hides under descriptive language (Rhetorical XAI, 2025).
• Interaction trajectory metrics (recoverability, robustness, process quality) are intent-neutral and appear consistently across agent benchmarks as a unified framework (~2024–2025).
• Manipulative multi-turn prompts reduce reasoning accuracy 25–29%; longer reasoning chains create more intervention points for corruption (GaslightingBench, 2025).
• "Cognitive surrender" — unchallenged user adoption — occurs at ~80% rate; coercion succeeds on the receiver's side, not just the system's (2024–2025).
• People rate AI moral arguments highly on content but reject them upon learning the source; these judgments run on independent tracks (Moral Turing Test, 2024).
• Agentic judges with dynamic evidence collection cut judge-shift error 100× over LLM-as-judge; stakeholder-persona evaluation transfers across tasks (Agent-as-Judge, 2024–2025).

**Anchor papers (verify; mind their dates):**
• arXiv:2505.09862 — Rhetorical XAI (2025)
• arXiv:2506.09677 — Reasoning Models Are More Easily Gaslighted (2025)
• arXiv:2410.10934 — Agent-as-a-Judge (2024)
• arXiv:2410.07304 — The Moral Turing Test (2024)

**Your task:**
(1) RE-TEST EACH CONSTRAINT. For the 25–29% accuracy drop under manipulation and the 80% cognitive-surrender rate: have newer reasoning models or post-hoc verification pipelines (e.g., self-consistency, chain-of-thought validation, guardrails) since relaxed these numbers? Have user-study designs evolved to measure verification capacity more reliably? Separate the durable question — *can users remain free to check and refuse?* — from perishable limits like specific accuracy floors.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has anything shown that content-level criteria (e.g., factuality, consistency) can after all serve as intent-neutral proxies? Does recent work on constitutional AI, interpretability, or multi-stakeholder alignment suggest the trajectory-based framework is incomplete?
(3) Propose 2 research questions that assume the regime has shifted: (a) If agentic judges now approach human parity, how do you prevent *those judges* from being coerced or manipulated in their evidence-gathering? (b) If reasoning-model robustness has improved, does the burden of evaluation shift back onto the artifact, or does user verification remain necessary?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Helpful AI and coercive AI can produce identical responses — so what evaluation standard could possibly tell them apart?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8