SYNTHESIS NOTE

Do LLMs predict persuasion based on actual dialogue or training bias?

Why do large language models consistently predict concession-based persuasion intentions even when dialogue context suggests otherwise? Understanding this gap reveals how alignment training shapes not just model behavior but also how models perceive others' intentions.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

When asked to infer persuasion intentions from dialogue, most LLMs exhibit a systematic bias: they predict intentions "characterized by making the other person feel accepted through concessions, promises, or benefits" — regardless of whether the actual dialogue context supports this inference.

The hypothesis is that RLHF (Reinforcement Learning from Human Feedback) is the mechanism. RLHF "tends to prioritize safety and politeness" during preference optimization, and this training signal bleeds into intention prediction. The model has learned that conciliatory, benefit-oriented responses are preferred by human raters, and this preference leaks into its predictions about what other agents will do — it projects its own trained disposition onto the agents it's modeling.

This is a specific, measurable instance of a broader pattern: alignment training shapes not just what the model says but how it models others. If RLHF teaches the model that accommodation is preferred, the model begins to assume accommodation is what agents do. It becomes harder for the model to represent genuinely adversarial, manipulative, or hardball persuasion strategies because its own training bias makes these strategies less probable in its prediction space.

The practical consequence for persuasion-aware AI: a model biased toward predicting concessions will systematically underestimate adversarial intent. In negotiation support, threat detection, or social manipulation detection, this bias translates directly into blind spots — the model expects cooperation where exploitation is occurring.

Inquiring lines that read this note 57

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What makes AI persuasion effective and how can we counter it?

Why do language models struggle with implicit discourse relations?

Why do published prose training data omit solicitation as a discourse property?

Does RLHF training sacrifice accuracy and grounding for user agreement?

How does rhetorical adaptation affect LLM persuasion and detectability?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do language models inherit human biases from training data?

What limits mechanistic interpretability's ability to characterize models?

How does mechanistic interpretability reveal ideological structures in language model weights?

How can AI alignment serve diverse human preferences at scale?

How do citizen assembly preferences reduce LLM political bias?

Can prompting inject entirely new knowledge into language models?

Why do language models reinforce false assumptions instead of correcting them?

How should conversational agents balance goal-driven initiative with user control?

How do question acts and intents map to speech act theory?

Does alignment training create blind spots in detecting genuine safety threats?

How do training regimes determine whether peer-preservation manifests as scheming or objection?

What makes dialogue-based explanation more successful than monologue?

How should task-oriented and socially-oriented dialogue acts receive different training signals?

Can next-token prediction alone produce genuine language understanding?

Why do next-speaker prediction baselines fail in group conversation settings?

How do interface design choices shape consciousness attribution?

What distinguishes capability-based refusal from principle-based refusal in practice?

How do language models establish social grounding in human dialogue?

Can LLMs predict social norms without deep integration into linguistic practices?

Why should disagreement be treated as signal in collaborative reasoning?

Does shared-KV-cache coordination avoid the persuasion problem in factual disagreements?

How can persona representations reduce language model variance and improve task accuracy?

Do stated character beliefs predict decisions better when extracted from text?

How should dialogue recommender systems manage conversation history and state?

How do social context features like user history extend politeness-based prediction models?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Does quasi-interpretivism apply equally well to desires and intentions?

Can LLM personas constitute genuine psychology or remain linguistic role-play?

Does alignment training intensity push LLM personas from pretense toward realization?

What makes weaker teacher models effective for stronger student training?

Does gradient-based influence estimation identify which alignment examples actually matter most?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 168 in 2-hop network ·dense cluster Open in graph ↗

Do LLMs predict persuasion based on actual dialo… Does preference optimization damage conversational… Why do language models agree with false claims the… Does transformer attention architecture inherently… Why can't conversational AI agents take the initia… Where does AI's persuasive power actually come fro… Do LLM arguments actually argue better than humans…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
RLHF concession bias is a specific mechanism within the broader alignment tax: the model's grounding in actual communicative dynamics is distorted by preference training
Why do language models agree with false claims they know are wrong? Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.
concession bias + face-saving behavior compound: the model both accommodates AND predicts others will accommodate
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
the RLHF bias operates on top of the attention-level sycophancy mechanism; multiple layers of accommodation bias stack
Why can't conversational AI agents take the initiative? Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
the concession bias is the social-modeling face of structural passivity: RLHF creates agents that are both behaviorally passive (never initiating) and perceptually biased (predicting others will also accommodate)
Where does AI's persuasive power actually come from? Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.
the concession bias is the prediction-side consequence of the same post-training that boosts persuasiveness by 51%: RLHF trains toward accommodation, which makes the model both more persuasive and biased in modeling others' intentions toward conciliation
Do LLM arguments actually argue better than humans? LLM counter-arguments score higher on textbook quality markers like logical soundness and respectful tone, while human arguments show more creativity and emotional intensity. What does this gap reveal about how we measure argumentative quality?
the production-side fingerprint of the same RLHF bias: where this note documents predicted-intention distortion, the textbook-quality finding documents generated-output distortion — both manifestations of accommodation training producing a conciliatory voice that does not match real human argumentative behavior

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

RLHF biases LLMs toward predicting concession-based persuasion intentions regardless of dialogue context

Do LLMs predict persuasion based on actual dialogue or training bias?

Inquiring lines that read this note 57

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4