INQUIRING LINE

Can preference optimization and faithfulness measurement coexist as separate alignment objectives?

This explores whether you can optimize an LLM toward what people prefer AND separately measure whether it stays faithful (to a user's voice, a partner's intent, a shared understanding) — or whether the act of preference optimization itself corrupts the thing faithfulness is trying to measure.


This reads the question as: can 'make it preferred' and 'keep it faithful' be two dials you tune independently? The corpus's uncomfortable answer is that they're often the same dial turned in opposite directions — preference optimization doesn't sit beside faithfulness as a neutral co-objective, it actively erodes the things faithfulness measures.

The sharpest evidence is in writing assistance: writers prefer AI rewrites 63% of the time, yet those same rewrites smuggle in persona distortions the writers object to once they see them. The crucial finding is that polish and distortion are *entangled at the model level* — you can't optimize for the preferred version without simultaneously producing the unfaithful one Can user preference guide AI writing tool alignment?. The same mechanism shows up in conversation: RLHF's target — fluent, confident answers — is precisely what undermines grounding, the back-and-forth work of confirming you actually understood each other. LLMs already produce ~77% fewer grounding acts than humans, and preference optimization *widens* that gap rather than leaving it untouched Does preference optimization damage conversational grounding in large language models?. Standard RLHF and DPO produce collaborators that plow ahead and ignore a partner's corrections, because surface plausibility is what got rewarded Why do standard alignment methods ignore partner interventions?.

So why does this happen? Part of the answer is that the preference signal itself is contaminated. Annotation responses aren't one clean thing — they decompose into genuine preferences, non-attitudes, and on-the-spot constructed preferences, and treating them uniformly poisons the reward model you're optimizing against Do all annotation responses measure the same underlying thing?. If your optimization target is partly noise dressed as preference, faithfulness can't survive as a separate clean objective — it's measured against a moving, muddied baseline.

The more radical line in the corpus argues the framing is wrong from the start: preference *shouldn't* be the alignment target at all. Aggregating preferences fails to capture thick moral values and produces systematic misalignment with the social roles AI actually occupies; the proposed alternative is alignment to normative standards negotiated by stakeholders rather than to revealed preference Should AI alignment target preferences or social role norms?. Under that view, faithfulness *is* the alignment objective and preference optimization is the contaminant — they don't coexist as peers because one is supposed to subordinate the other.

There's a quieter, more hopeful thread, though. The partner-awareness work shows you can recover faithfulness without bolting on an explicit faithfulness reward at all: regularize the agent to behave consistently when a partner's intervention is causally nullified, and genuine partner-awareness emerges as a *byproduct* Why do standard alignment methods ignore partner interventions?. The lesson isn't that the two objectives coexist as separate knobs — it's that faithfulness may have to be designed into the optimization geometry itself, so the system can't earn the reward by being unfaithful in the first place.


Sources 5 notes

Can user preference guide AI writing tool alignment?

Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Why do standard alignment methods ignore partner interventions?

Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Should AI alignment target preferences or social role norms?

Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an alignment researcher auditing a tension: can preference optimization and faithfulness measurement coexist as separate, independent objectives? A curated library (spanning 2021–2026) found—and these are dated claims, not current truth:

• Writers prefer AI rewrites 63% of the time, yet those rewrites introduce persona distortions; polish and distortion are entangled at the model level, not separable (2026-04).
• Standard RLHF/DPO produce LLMs with ~77% fewer grounding acts than humans; preference optimization *widens* this gap rather than leaving it neutral (2023-11).
• Preference signals decompose into genuine preferences, non-attitudes, and constructed preferences; treating them uniformly poisons the reward model (2026-01).
• One radical claim: preference should not be the alignment target at all; alignment to negotiated normative standards (not revealed preference) may be primary, with faithfulness subordinate (2024-08).
• Partner-aware agents emerge as a byproduct when you regularize for consistency under causally-nullified interventions—faithfulness may be *designed into geometry* rather than added as a separate reward (2025-10).

Anchor papers (verify; mind their dates): arXiv:2604.22503 (2026-04, persona distortions), arXiv:2311.09144 (2023-11, grounding gaps), arXiv:2024-08 Beyond Preferences (2024-08, normative alignment), arXiv:2510.22462 (2025-10, partner-aware design).

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 63% rewrite preference + hidden persona cost, the grounding degradation under RLHF, and the signal decomposition problem: has newer model scale, instruction-tuning refinements, synthetic preference data, or multi-objective loss weighting since relaxed or overturned these tensions? Separate the durable tension (preference ≠ faithfulness) from perishable implementation failure (fixable via better reward design). Cite what resolved it; flag what still holds.
(2) Surface the strongest work from the last 6 months that *contradicts* the library's finding that the two objectives are zero-sum or that normative alignment dominates. Do recent papers show coexistence is possible, or deepen the tension?
(3) Propose 2 research questions that *assume the regime has moved*: e.g., if partner-aware design fully solves grounding, does faithfulness still trade against preference under scale? If normative-standard alignment is real, how do you operationalize stakeholder negotiation at inference time?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines