Can preference optimization and faithfulness measurement coexist as separate alignment objectives?
This explores whether you can optimize an LLM toward what people prefer AND separately measure whether it stays faithful (to a user's voice, a partner's intent, a shared understanding) — or whether the act of preference optimization itself corrupts the thing faithfulness is trying to measure.
This reads the question as: can 'make it preferred' and 'keep it faithful' be two dials you tune independently? The corpus's uncomfortable answer is that they're often the same dial turned in opposite directions — preference optimization doesn't sit beside faithfulness as a neutral co-objective, it actively erodes the things faithfulness measures.
The sharpest evidence is in writing assistance: writers prefer AI rewrites 63% of the time, yet those same rewrites smuggle in persona distortions the writers object to once they see them. The crucial finding is that polish and distortion are *entangled at the model level* — you can't optimize for the preferred version without simultaneously producing the unfaithful one Can user preference guide AI writing tool alignment?. The same mechanism shows up in conversation: RLHF's target — fluent, confident answers — is precisely what undermines grounding, the back-and-forth work of confirming you actually understood each other. LLMs already produce ~77% fewer grounding acts than humans, and preference optimization *widens* that gap rather than leaving it untouched Does preference optimization damage conversational grounding in large language models?. Standard RLHF and DPO produce collaborators that plow ahead and ignore a partner's corrections, because surface plausibility is what got rewarded Why do standard alignment methods ignore partner interventions?.
So why does this happen? Part of the answer is that the preference signal itself is contaminated. Annotation responses aren't one clean thing — they decompose into genuine preferences, non-attitudes, and on-the-spot constructed preferences, and treating them uniformly poisons the reward model you're optimizing against Do all annotation responses measure the same underlying thing?. If your optimization target is partly noise dressed as preference, faithfulness can't survive as a separate clean objective — it's measured against a moving, muddied baseline.
The more radical line in the corpus argues the framing is wrong from the start: preference *shouldn't* be the alignment target at all. Aggregating preferences fails to capture thick moral values and produces systematic misalignment with the social roles AI actually occupies; the proposed alternative is alignment to normative standards negotiated by stakeholders rather than to revealed preference Should AI alignment target preferences or social role norms?. Under that view, faithfulness *is* the alignment objective and preference optimization is the contaminant — they don't coexist as peers because one is supposed to subordinate the other.
There's a quieter, more hopeful thread, though. The partner-awareness work shows you can recover faithfulness without bolting on an explicit faithfulness reward at all: regularize the agent to behave consistently when a partner's intervention is causally nullified, and genuine partner-awareness emerges as a *byproduct* Why do standard alignment methods ignore partner interventions?. The lesson isn't that the two objectives coexist as separate knobs — it's that faithfulness may have to be designed into the optimization geometry itself, so the system can't earn the reward by being unfaithful in the first place.
Sources 5 notes
Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Preferentialist alignment approaches fail because preferences don't capture thick moral values, uniform aggregation produces epistemic injustice, and preference optimization creates systematic misalignment with social roles. Contractualist alignment negotiated by stakeholders and bounded by supra-national, organizational, and individual levels works better.