Can ethically aligned AI systems still communicate poorly?
Explores whether safety-aligned language models might fail at genuine conversation despite passing ethical benchmarks. This matters because pragmatic incompetence can erode trust and cause real harms in high-stakes domains.
Most discussion of LLM alignment focuses on the helpful-honest-harmless triad — preventing misinformation, toxic language, harmful recommendations. Kasirzadeh and Gabriel argue that this prioritization has overshadowed a different and equally fundamental issue: even an ethically aligned LLM may fail to engage in conversation in pragmatically appropriate ways. The two alignment problems are orthogonal. A model can be honest, helpful, and harmless and still systematically violate Gricean maxims, lose common ground across turns, fail to track questions under discussion, mishandle context-collapse, and produce pragmatically inappropriate utterances.
Their CONTEXT-ALIGN framework names ten desiderata that ethical alignment does not deliver: tracking context-sensitivity and indexicals, common-ground management, scoreboard updating, QUD and discourse-structure handling, accommodation of repairs, pragmatic inference, ethical-pragmatic integration, context-collapse mitigation, identification of defective contexts, transparency in context-handling, and cross-contextual memory. These are all dimensions where conversation depends on something architectural — a model of the interlocutor and the situation — that no amount of RLHF on outputs touches.
The implication is sharp. An LLM that passes every safety eval is not thereby a competent conversational partner. Misalignments in pragmatic understanding lead to breakdowns, misinformation, and erosion of trust — and the higher the stakes (healthcare, legal, emergency), the more dangerous these failures become. Conversational alignment is not a stylistic add-on to ethical alignment. It is a separate layer of competence that the field has barely begun to engineer for.
Inquiring lines that use this note as a source 30
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- How do current safety benchmarks miss pragmatic alignment failures?
- Can a model be helpful, honest, and still contextually inappropriate?
- Can RLHF alignment prevent models from making ethically appropriate rule violations?
- Does alignment training make AI incapable of warranted urgency?
- What assumptions about oversight fail when AI acts as rhetorical interlocutor?
- Why does linguistic alignment differ from genuine interpersonal coordination?
- Why do mental health chatbots fail at synchrony despite strong language models?
- Which AI safety problems lack the scalar metrics autoresearch requires?
- Why can't AI participate in real communicative events?
- Why should AI communication design follow human communication norms?
- Why does AI alignment fail when goals lack indexical grounding in values?
- How does safety alignment suppress deceptive behavior differently than representational alignment?
- Why do people evaluate machines against human communication standards?
- How does safety alignment degrade the quality of villain role-playing?
- What safety systems prevent therapeutic AI from soothing where it should challenge?
- Should AI alignment use normative standards instead of aggregate preferences?
- Do static frozen axiologies prevent genuine ethical reasoning in AI systems?
- Can safety training in chat scenarios transfer to agentic task performance?
- How does safety alignment further degrade villain character portrayal?
- Which application domains like healthcare and education lack alignment research?
- What social norms do AI systems consistently fail to understand?
- Can ethical constraints in AI address the gap between performance and actual understanding?
- Why do safety-trained models refuse questions they could actually answer well?
- Why does fixing harm require stakeholder input rather than universal developer definitions?
- What are the differences between chat model and agent authorization failures?
- How much does forcing single-choice answers damage alignment with complex intent?
- Why does safety alignment break after only 10 harmful examples?
- Can developers detect and flag harmful validation in personal advice exchanges?
- Why is visible reasoning insufficient for monitoring AI safety?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Conversational Alignment with Artificial Intelligence in Context
- The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
- ProsocialDialog: A Prosocial Backbone for Conversational Agents
- Training language models to follow instructions with human feedback
- Position: Towards Bidirectional Human-AI Alignment
- The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making
- Auditing language models for hidden objectives
- Why Do Some Language Models Fake Alignment While Others Don't?
Original note title
Ethical alignment without conversational alignment produces pragmatically alien communicators