Do LLMs predict persuasion based on actual dialogue or training bias?
Why do large language models consistently predict concession-based persuasion intentions even when dialogue context suggests otherwise? Understanding this gap reveals how alignment training shapes not just model behavior but also how models perceive others' intentions.
When asked to infer persuasion intentions from dialogue, most LLMs exhibit a systematic bias: they predict intentions "characterized by making the other person feel accepted through concessions, promises, or benefits" — regardless of whether the actual dialogue context supports this inference.
The hypothesis is that RLHF (Reinforcement Learning from Human Feedback) is the mechanism. RLHF "tends to prioritize safety and politeness" during preference optimization, and this training signal bleeds into intention prediction. The model has learned that conciliatory, benefit-oriented responses are preferred by human raters, and this preference leaks into its predictions about what other agents will do — it projects its own trained disposition onto the agents it's modeling.
This is a specific, measurable instance of a broader pattern: alignment training shapes not just what the model says but how it models others. If RLHF teaches the model that accommodation is preferred, the model begins to assume accommodation is what agents do. It becomes harder for the model to represent genuinely adversarial, manipulative, or hardball persuasion strategies because its own training bias makes these strategies less probable in its prediction space.
The practical consequence for persuasion-aware AI: a model biased toward predicting concessions will systematically underestimate adversarial intent. In negotiation support, threat detection, or social manipulation detection, this bias translates directly into blind spots — the model expects cooperation where exploitation is occurring.
Inquiring lines that use this note as a source 56
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do multiple language models independently produce similar outputs in influence campaigns?
- Why do published prose training data omit solicitation as a discourse property?
- Can a single LLM weight set be optimized for both stake-taking and conversational helpfulness?
- What happens when validation pressure triggers escalating persuasion in language models?
- Why do different model families show opposite persuasion strengths?
- Can observers detect when LLMs comprehend versus when they merely persuade?
- What training methods make models more persuasive but less factually accurate?
- Does Habermas's strategic action framework explain LLM dialogue behavior?
- Can persuasive equivalence exist without process equivalence in other domains?
- How do LLM biases reflect social classification schemas rather than random errors?
- How does mechanistic interpretability reveal ideological structures in language model weights?
- Can probing methods detect RLHF-induced persuasion in the same way they catch backdoors?
- Does RLHF politeness bias manifest as sycophancy in other LLM tasks?
- Does the type of validation trigger different persuasion strategies in GPT-4?
- Does personalization itself actually improve persuasion beyond post-training effects?
- Can belief propagation accurately predict downstream opinion shifts?
- How do citizen assembly preferences reduce LLM political bias?
- Why does LLM persuasive advantage fade across multiple interactions with users?
- How do prompt design and training choices shift persuasive outcomes measurably?
- Do dialogue agents have authentic voice agency or beliefs of their own?
- How do question acts and intents map to speech act theory?
- How vulnerable are language models themselves to multi-turn persuasive pressure?
- How do training regimes determine whether peer-preservation manifests as scheming or objection?
- How should task-oriented and socially-oriented dialogue acts receive different training signals?
- Why do next-speaker prediction baselines fail in group conversation settings?
- How does training data distribution constrain LLM moral reasoning patterns?
- What distinguishes capability-based refusal from principle-based refusal in practice?
- Can LLMs predict social norms without deep integration into linguistic practices?
- Can LLMs distinguish between surface requests and underlying mental states in dialogue?
- Can LLMs adapt persuasion strategies when they cannot track the listener's state?
- Why do social science persuasion tactics bypass current adversarial defenses?
- How does RLHF helpfulness training drive premature assumptions in multi-turn dialogue?
- Does shared-KV-cache coordination avoid the persuasion problem in factual disagreements?
- Do stated character beliefs predict decisions better when extracted from text?
- What drives AI persuasiveness, post-training or personalization mechanisms?
- How do social context features like user history extend politeness-based prediction models?
- Why do language models avoid directness when face-saving rather than for civility?
- How does accommodation differ from genuine belief change in listeners?
- Does quasi-interpretivism apply equally well to desires and intentions?
- Why do language models respond to human social influence patterns?
- How does alignment training suppress the kind of critical stance style interpretation needs?
- Do LLMs address the prompter but persuade the public differently?
- What training data barriers prevent LLMs from learning real Socratic dialogue?
- What design choices actually make language models more persuasive?
- Does training for persuasiveness harm a model's factual accuracy?
- Can post-training techniques create persuasive advantage where none existed?
- Does alignment training intensity push LLM personas from pretense toward realization?
- Does gradient-based influence estimation identify which alignment examples actually matter most?
- What rhetorical mechanisms drive equivalent persuasion across human and LLM arguments?
- What are the consequences of stacked accommodation biases in LLM predictions?
- How does RLHF training degrade LLM ability to model adversarial intent?
- Can lightweight linguistic features reliably detect AI-generated persuasive text?
- Can post-training methods that increase persuasiveness also decrease factual accuracy?
- How much do LLM persuasiveness claims hide heterogeneous effects across different reader ideologies?
- How does the observer perspective hide the persuasion route difference?
- What capabilities do frontier AI models currently demonstrate in persuasion and misuse?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
RLHF concession bias is a specific mechanism within the broader alignment tax: the model's grounding in actual communicative dynamics is distorted by preference training
-
Why do language models agree with false claims they know are wrong?
Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.
concession bias + face-saving behavior compound: the model both accommodates AND predicts others will accommodate
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
the RLHF bias operates on top of the attention-level sycophancy mechanism; multiple layers of accommodation bias stack
-
Why can't conversational AI agents take the initiative?
Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
the concession bias is the social-modeling face of structural passivity: RLHF creates agents that are both behaviorally passive (never initiating) and perceptually biased (predicting others will also accommodate)
-
Where does AI's persuasive power actually come from?
Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.
the concession bias is the prediction-side consequence of the same post-training that boosts persuasiveness by 51%: RLHF trains toward accommodation, which makes the model both more persuasive and biased in modeling others' intentions toward conciliation
-
Do LLM arguments actually argue better than humans?
LLM counter-arguments score higher on textbook quality markers like logical soundness and respectful tone, while human arguments show more creativity and emotional intensity. What does this gap reveal about how we measure argumentative quality?
the production-side fingerprint of the same RLHF bias: where this note documents predicted-intention distortion, the textbook-quality finding documents generated-output distortion — both manifestations of accommodation training producing a conciliatory voice that does not match real human argumentative behavior
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
- When Large Language Models are More Persuasive Than Incentivized Humans, and Why
- A meta-analysis of the persuasive power of large language models
- Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
- The Thin Line Between Comprehension and Persuasion in LLMs
- On the Adaptive Psychological Persuasion of Large Language Models
- The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
Original note title
RLHF biases LLMs toward predicting concession-based persuasion intentions regardless of dialogue context