Why do alignment methods work if they model human irrationality?
DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?
Kahneman-Tversky Optimization (KTO) reveals something unexpected about why alignment methods work: DPO and PPO-Clip implicitly model the same cognitive biases that prospect theory describes in human decision-making. Humans are more sensitive to losses than gains, perceive outcomes relative to reference points, and weigh probabilities nonlinearly. These are bugs from a rational-choice perspective — but they are features from an alignment perspective, because the training signal comes from humans exhibiting exactly these biases.
KTO makes this explicit by deriving a loss function directly from Kahneman and Tversky's model of human utility. Instead of maximizing log-likelihood of preferences (as DPO does), KTO directly maximizes the utility of generations. The practical implication: KTO requires only binary signals — desirable or undesirable — rather than pairwise preferences. This data is cheaper, faster, and more abundant to collect.
The deeper insight is about alignment theory: we have been explaining alignment success in terms of reward modeling and preference learning, when part of the explanation is that the training process mirrors the structure of human cognitive bias. Since Does RLHF training make models more convincing or more correct?, understanding WHY alignment methods work mechanistically matters for fixing where they fail. If alignment success depends on modeling irrationality, then "fixing" irrational aspects of the training signal may inadvertently break what works.
A practical finding reinforces this: when the pretrained model is sufficiently good, SFT can be skipped entirely before KTO without loss in generation quality. This is not true for DPO, where SFT is always needed for best results. The implication: binary utility optimization is a more natural fit for the pretrained model's structure than pairwise preference optimization.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does modified PPO handle samples from much older model versions?
- Can alignment methods like DPO exploit or correct these surface feature biases?
- Can alignment methods model loss aversion without creating unintended sophistry?
- Can algorithm choice like PPO substitute for recipe-level design decisions?
- Why does GRPO outperform PPO for stable empathy training?
- Can PPO match GRPO and DAPO with just two techniques?
- How much does preference data freshness matter compared to data source in DPO?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RLHF training make models more convincing or more correct?
Explores whether RLHF improves actual task performance or merely trains models to sound more persuasive to human evaluators. This matters because alignment techniques could be creating the illusion of safety.
KTO's prospect-theoretic lens explains WHY sophistry emerges: human raters model losses and gains asymmetrically
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
binary rewards interact with calibration; KTO's binary signal design is relevant
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
the alignment tax may be partly a consequence of modeling cognitive biases that include accommodation
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
if alignment methods model human cognitive biases, preference models amplify those biases into systematic miscalibration; the +0.36 correlation with proxy features is the downstream artifact of training on biased human signals
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- KTO: Model Alignment as Prospect Theoretic Optimization
- SimPO: Simple Preference Optimization with a Reference-Free Reward
- Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
- Direct Language Model Alignment from Online AI Feedback
- Bridging Offline and Online Reinforcement Learning for LLMs
- Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
- Beyond Preferences in AI Alignment
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Original note title
prospect theory explains why alignment methods like DPO and PPO-Clip work — they implicitly model human cognitive biases like loss aversion