INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›What architectural and training st…›Can ensemble evaluation methods re…›this inquiring line

When values collide, the real question isn't which one wins — it's whether the process for choosing can itself be trusted.

What makes a process for choosing between values legitimate and fair?

This explores what makes a *procedure* for resolving value conflicts trustworthy — not which values win, but how the choosing gets done so the outcome counts as legitimate and fair.

This reads the question as being about process, not verdict: when values collide, what makes the *way we decide between them* something people can accept as fair? The corpus has a surprisingly direct answer, and it starts with what fairness is *not*. One line of work argues that aggregating everyone's preferences — averaging them into a single answer — actively produces unfairness: uniform aggregation flattens 'thick' moral values into thin votes and creates epistemic injustice, systematically misrepresenting whoever the average rolls over Should AI alignment target preferences or social role norms?. The same failure shows up at the other extreme: when you personalize to each user instead of averaging, you remove the aggregate's moderating effect and the system slides into sycophancy and echo chambers Does personalizing reward models amplify user echo chambers?. So neither 'average everyone' nor 'give each person their own' is a legitimate process. The fair alternative offered is contractualist — values negotiated among stakeholders and bounded at multiple levels (individual, organizational, supra-national) rather than computed from a poll.

The second ingredient of legitimacy is that the conflict has to actually be *visible* in the process. A recurring finding is that fairness requires preserving tensions rather than dissolving them: one system tracks hundreds of thousands of values across tens of thousands of situations and deliberately keeps conflicts intact — modeling relevance, valence, and explanation — instead of resolving them by majority vote Can AI systems preserve moral value conflicts instead of averaging them?. A process that hides the trade-off it made can't be audited for fairness; one that surfaces it can.

Third, legitimacy depends on the choices being *explicit and revisable* rather than smuggled in. Research on human-centered design shows there is no universal answer because harm itself depends on whose perspective you take, which leaves developers making implicit value calls — and the fix is to make those calls explicit and open to revision Can human-centered LLM design ever achieve universal solutions?. Relatedly, *when* the choosing happens matters: values patched on at the end can't undo harms baked into earlier stages, so a fair process has to run through the whole pipeline, not bolt on at deployment When should human values enter the LLM development pipeline?. This is where today's systems fall short — their 'ethics' are fixed defaults frozen at training time, not negotiable moves adapted to the situation in front of them Can language models balance competing ethical norms in context?.

The most interesting thread, and the one you might not expect, is that fair process may require a particular *shape of disagreement*. There's a distinct dialogue type — dialectical reconciliation — where both sides adjust their positions through exchange until they're compatible but not identical, and the finding is that current AI collapses this into either false agreement or 'AI wins by persuasion' Can disagreement be resolved without either party fully yielding?. That reframes legitimacy as something closer to mutual yielding than to victory. And there's a sharp warning attached: people rate moral justifications highly on content but reject them once they learn the source was an AI Do people prefer AI moral reasoning when they don't know the source?. Legitimacy, in other words, isn't only in the quality of the reasoning — it's in *who* people will accept as a legitimate party to the negotiation at all. A value-choosing process can be procedurally clean and still fail the fairness test if the participants don't grant its author standing.

Sources 8 notes

Should AI alignment target preferences or social role norms?

Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can AI systems preserve moral value conflicts instead of averaging them?

ValuePrism demonstrates that AI can track 218k values across 31k situations while preserving conflicts rather than resolving them through voting. Four modeling tasks—generation, relevance, valence, and explanation—make pluralistic moral reasoning computationally tractable.

Can human-centered LLM design ever achieve universal solutions?

Research shows that optimal LLM design paths depend on stakeholder identity and how contested concepts like harm are operationalized. High-level guidelines fail to capture real-world nuance, leaving developers to make implicit value choices rather than explicit, revisable ones.

When should human values enter the LLM development pipeline?

The HCLLM framework argues that human-centered objectives fail when treated as downstream alignment patches. Values introduced only at post-training cannot recover harms baked into data sourcing or training objectives, so embedding human priorities at every stage—data, training, evaluation, deployment—is architecturally necessary.

Show all 8 sources

Can language models balance competing ethical norms in context?

LLMs cannot perform the situated trade-offs that human pragmatic competence requires. Their ethical principles are structural defaults set at training time, not negotiable moves adapted to context, creating a gap between ethical adherence and communicative appropriateness.

Can disagreement be resolved without either party fully yielding?

Research identifies a distinct dialogue type where both parties modify their positions through exchange until compatible but not identical. Current AI systems collapse this into false agreement or AI-wins persuasion.

Do people prefer AI moral reasoning when they don't know the source?

Participants rated utilitarian moral arguments higher when attributed to LLMs, but agreement dropped when told the arguments were AI-generated. The preference for content and rejection of source operate independently through different psychological processes.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data3.96 match · arxiv ↗
Conversational Alignment with Artificial Intelligence in Context3.27 match · arxiv ↗
Beyond Preferences in AI Alignment2.52 match · arxiv ↗
The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making2.52 match · arxiv ↗
Position: Towards Bidirectional Human-AI Alignment2.48 match · arxiv ↗
Large Language Models Do Not Simulate Human Psychology2.44 match · arxiv ↗
Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments1.63 match · arxiv ↗
Large Language Models Reflect the Ideology of their Creators1.62 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI alignment researcher re-testing claims about procedural legitimacy in value conflict. The question remains: what makes a process for choosing between values legitimate and fair?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints to re-examine.
• Uniform preference aggregation flattens thick moral values and commits epistemic injustice; personalization slides into sycophancy and echo chambers — neither is fair (2023–2024).
• Fair processes require *visible* tensions: keeping conflicts intact rather than majority-voting them away; one system tracked hundreds of thousands of values across tens of thousands of situations (2023).
• Value choices must be explicit and revisable across the entire pipeline, not frozen defaults or late-stage patches; current systems fix ethics at training time (2024).
• Dialectical reconciliation — mutual position-adjustment through exchange — is distinct from false agreement or persuasion-as-victory; current AI collapses it into the latter two (2023).
• People reject AI moral justifications once they learn the source is non-human, even when content quality is high — legitimacy depends on granting standing to the negotiation's parties (2024).

Anchor papers (verify; mind their dates):
• arXiv:2306.14694 (2023): DR-HAI on dialectical reconciliation.
• arXiv:2309.00779 (2023): Value Kaleidoscope on pluralistic value modeling.
• arXiv:2408.16984 (2024): Beyond Preferences in AI Alignment.
• arXiv:2505.22907 (2025): Conversational Alignment with Artificial Intelligence in Context.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, Claude 3.5, newer reasoning systems), multi-agent orchestration, debate-based alignment, or adversarial testing have since RELAXED or OVERTURNED it. Where a constraint still holds, name what tried to overcome it and failed. Where it has dissolved, cite the work that dissolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — particularly around whether AI can hold standing in value negotiation, whether dialectical reconciliation is actually achievable, or whether preference aggregation methods have improved.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., can adversarial debate or multi-agent models engineer actual mutual yielding, not just apparent agreement? Can legitimacy be *earned* by an AI system through transparent iteration rather than granted at the start?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When values collide, the real question isn't which one wins — it's whether the process for choosing can itself be trusted.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8