INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›What architectural and training st…›How can AI alignment serve diverse…›this inquiring line

Treating all human preferences equally when aligning AI sounds fair — but it might just entrench whoever's in the majority.

What does egalitarian social choice theory contribute to AI alignment?

This reads the question as: what does the formal theory of fairly aggregating individual preferences into a collective choice — voting rules, equal weighting, welfare aggregation — actually buy us when we try to align AI with human values, and where the corpus says it breaks down.

This explores what social choice theory's egalitarian impulse — give everyone equal weight, then aggregate preferences into one collective answer — contributes to AI alignment, and the corpus's verdict is mostly a warning: the egalitarian move that looks fairest in theory is where alignment quietly goes wrong. The dominant 'preferentialist' approach to alignment is essentially applied social choice — collect human preference judgments (the Pin RLHF) and optimize a model toward their aggregate. The corpus argues this inherits social choice theory's deepest known problem. When you aggregate uniformly, you don't get a neutral average; you get the majority's values stamped onto everyone, which Should AI alignment target preferences or social role norms? names as epistemic injustice — minority moral framings get rounded off. Its proposed fix is anti-aggregative: contractualist alignment negotiated by stakeholders at distinct levels, closer to a bargaining table than a ballot box.

The sharpest contribution comes from flipping the egalitarian goal on its head. Classic social choice wants to *resolve* disagreement into a single ranking; Can AI systems preserve moral value conflicts instead of averaging them? argues alignment should *preserve* it. ValuePrism tracks 218k values across 31k situations and deliberately refuses to vote them down to one answer, keeping the conflicts legible. The egalitarian intuition here isn't 'count everyone equally then collapse' — it's 'represent everyone's value even when it loses.' That reframes equality as visibility rather than aggregation, which is a genuinely different design target than a welfare-maximizing social welfare function.

There's also a participation problem that social choice theory assumes away. The whole apparatus presumes preferences exist prior to the vote, ready to be counted. But Can AI predict social norms better than humans? and Can AI learn social norms better than humans? show that norms aren't a fixed distribution to sample — GPT-4.5 can predict appropriateness better than any individual human yet structurally can't enter the community process that *creates and validates* the norms in the first place. Egalitarian aggregation has nothing to say about who gets to author the menu of options being voted on, which may be the more decisive form of power.

Two further notes widen the frame. Does incremental AI replacement erode human influence over society? suggests the relevant 'votes' in real societal alignment aren't survey responses but the economic dependence on human labor — as AI removes that, the implicit channel through which human preferences steer institutions decays, no formal aggregation rule required. And Can models learn behavioral principles without preference labels? (SAMI) shows you can align a model to written principles *without preference labels at all* by maximizing mutual information between a constitution and responses — an end-run around the entire collect-and-aggregate paradigm, where a weaker model can even author principles that align a stronger one.

So the contribution is largely diagnostic. Egalitarian social choice gives alignment its default vocabulary — equal weighting, preference aggregation, welfare functions — and the corpus uses that vocabulary mostly to mark its limits: uniform aggregation manufactures injustice, voting destroys pluralism it should preserve, and counting preferences ignores who gets to participate in making them. The more promising directions in the collection — contractual negotiation, explicit value-tension modeling, constitution-from-principles — are all reactions against the aggregative core, not refinements of it.

Sources 6 notes

Should AI alignment target preferences or social role norms?

Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.

Can AI systems preserve moral value conflicts instead of averaging them?

ValuePrism demonstrates that AI can track 218k values across 31k situations while preserving conflicts rather than resolving them through voting. Four modeling tasks—generation, relevance, valence, and explanation—make pluralistic moral reasoning computationally tractable.

Can AI predict social norms better than humans?

GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

Does incremental AI replacement erode human influence over society?

Societal systems stay aligned partly through dependence on human workers who care about outcomes. As AI replaces this labor, explicit alignment controls weaken and systems drift from human preferences. Interdependent misalignment across institutions could become irreversible.

Show all 6 sources

Can models learn behavioral principles without preference labels?

SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Preferences in AI Alignment3.34 match · arxiv ↗
Position: Towards Bidirectional Human-AI Alignment2.50 match · arxiv ↗
Humans learn to prefer trustworthy AI over human partners2.45 match · arxiv ↗
The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs2.43 match · arxiv ↗
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data2.40 match · arxiv ↗
AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms1.84 match · arxiv ↗
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development1.67 match · arxiv ↗
Conversational Alignment with Artificial Intelligence in Context1.64 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an alignment researcher re-testing whether social choice theory's egalitarian constraints on AI still hold. The question: what does egalitarian social choice contribute to AI alignment?

What a curated library found — and when (dated claims, not current truth): The findings span 2023–2026 and argue egalitarian aggregation is alignment's core vulnerability, not solution:

• Uniform preference aggregation manufactures epistemic injustice: minority moral framings get rounded off by majority values (2024).
• Egalitarian voting *destroys* value pluralism it should preserve; ValuePrism instead tracks 218k values across 31k situations, refusing to collapse conflicts into one ranking (2024).
• GPT-4.5 predicts social norms with superhuman accuracy yet structurally cannot participate in the community processes that *create and validate* norms—egalitarian aggregation ignores who authors the option menu (2025).
• As AI removes economic dependence on human labor, the implicit preference-steering channel decays, making formal aggregation rules irrelevant (2025).
• Constitutional alignment (SAMI) bypasses preference labels entirely, maximizing mutual information between written principles and responses—a non-aggregative path (2024).

Anchor papers (verify; mind their dates):
• arXiv:2309.00779 (2023) — Value Kaleidoscope: pluralistic values in tension.
• arXiv:2404.14313 (2024) — SAMI: mutual information without preferences.
• arXiv:2508.19004 (2025) — norm prediction vs. norm participation gap.
• arXiv:2501.16946 (2025) — disempowerment erodes preference-steering channels.

Your task:

(1) RE-TEST EACH CONSTRAINT. For each finding above, ask: have newer training methods (DPO, PPO variants), multi-agent orchestration, or constitutional approaches since *relaxed* the preference-aggregation bottleneck? Does uniform aggregation still manufacture injustice in 2025+ models, or do newer sampling/routing strategies preserve minority values? Has the norm-prediction/participation gap widened or narrowed? Separate the durable insight (aggregation trades off pluralism) from the perishable limitation (preference labels are necessary).

(2) Surface the strongest *disagreeing* work from the last 6 months: papers arguing egalitarian aggregation *does* scale, or that preference-collection methods have solved injustice better than the library claims.

(3) Propose 2 research questions that *assume* the regime may have shifted: (a) Can constitutional+multi-agent negotiation fully replace preference aggregation without losing majority-preference grounding? (b) If AI removes labor-based preference channels, what *new* egalitarian mechanisms emerge?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Treating all human preferences equally when aligning AI sounds fair — but it might just entrench whoever's in the majority.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8