SYNTHESIS NOTE
Psychology, Society, and Alignment

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Synthesis note · 2026-02-23 · sourced from Alignment
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The "Beyond Preferences" paper identifies four theses that constitute the preferentist approach dominating AI alignment — and challenges all of them:

  1. Rational Choice Theory as descriptive framework — human behavior is well-modeled as preference maximization. But preferences fail to capture the thick semantic content of values. A preference for copyright violation may maximize aggregate immediate welfare while violating all-things-considered moral judgment.

  2. Expected Utility Theory as normative standard — rational agency requires utility maximization. But EUT is neither necessary nor sufficient for rational agency. We can design AI systems with locally coherent preferences that are not representable as a utility function.

  3. Single-Principal Alignment as preference matching — align AI with one human's preferences. But preferences are dynamic, contextual, and often incommensurable even within a single person. Reward functions cannot serve as alignment targets for broadly-scoped systems.

  4. Multi-Principal Alignment as preference aggregation — aggregate everyone's preferences. But uniform aggregation constitutes epistemic injustice when most annotators are insensitive to identity discrimination. If RLHF labelers don't recognize transphobic or antisemitic content, the trained model won't either.

The alternative: AI should align with normative standards appropriate to its social roles (assistant, advisor, companion), negotiated by all relevant stakeholders. This is a contractualist framing — what people would reasonably agree to — rather than a utilitarian one. Preferences serve as proxies for values, informative of underlying structures, but not alignment targets in themselves.

This reframes the alignment tax identified in Does preference optimization harm conversational understanding?. The tax exists because preference optimization targets a proxy that is systematically misaligned with the social role the system is meant to fill. A conversational assistant's normative standard should include grounding acts; RLHF's preference signal systematically selects against them.

The political infeasibility argument is particularly sharp: building AI that optimizes humanity's aggregate preferences would centralize immense power. Even pro-social developers face market incentives that prevent impartially benevolent optimization. The contractualist alternative distributes decision-making rather than centralizing it.

The "Personalisation within Bounds" paper extends this philosophical critique into practical governance. It identifies a "tyranny of the crowdworker" — RLHF alignment reflects whoever happened to label the data, with little documentation of who these labelers are or what perspectives they represent. The paper proposes a three-tiered policy framework: (1) supra-national bounds (safety, universal norms), (2) organizational bounds (institutional values, domain standards), and (3) individual personalization (user preferences within the bounded space). This provides a concrete implementation of the contractualist alternative — personalization is not unconstrained preference-matching but operates within negotiated societal and organizational limits.

Extension — the measurement pincer: The Beyond Preferences critique operates at the normative level: preferences are the wrong kind of target for alignment. A complementary critique operates at the measurement level: even within the preferentist framework, the preferences being measured are often not preferences at all. Are RLHF annotations actually measuring genuine human preferences? argues from behavioral science that annotation responses frequently reflect non-attitudes, constructed preferences, and measurement artifacts rather than stable preferences. Taken together, the two critiques form a pincer: preferences are both wrong-in-kind (normative argument) and wrong-in-measurement (measurement argument). A reader who resists the normative argument because they find preferentism theoretically coherent still faces the measurement argument: the inputs feeding the preferentist pipeline are invalid, so no aggregation rule can recover what was never there. This strengthens the contractualist case by denying preferentism even its empirical foothold.

Enrichment — the operationalization-dependence argument. The HCLLM survey reaches the role-and-standards conclusion from a practical rather than a metaphysical direction, which is why it converges with this note. It argues that human-centered objectives "tend to resist universal solutions" because the optimal path depends both on who you ask and on how you operationalize contested concepts like harm and benefit. This is the applied face of the wrong-in-kind critique: if value is not a scalar preference but a thick, role-relative standard, then "align to preferences" underdetermines the target — every operationalization encodes a contestable choice about whose standard, measured how. The survey's worry that high-level guidelines lag real-world nuance and that passive stakeholders end up endorsing the status quo is exactly what happens when a wrong-kind target is treated as if it had a universal solution. Role-appropriate normative standards are the alternative both arguments point to. Source: Human Centered Design — "Reflections and New Directions for Human-Centered Large Language Models", https://arxiv.org/abs/2605.06901

Inquiring lines that use this note as a source 19

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 10

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
23 direct connections · 202 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

AI should align with normative standards appropriate to social roles not with individual or aggregate preferences