SYNTHESIS NOTE

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Synthesis note · 2026-04-07 · sourced from Alignment

Behavioral science's six-decade accumulation of preference elicitation research produces a taxonomy that RLHF practice collapses into a single signal. The three categories matter because they require different treatment — and treating them uniformly is the upstream mistake that Are RLHF annotations actually measuring genuine human preferences? argues contaminates the entire pipeline.

Genuine preferences manifest stably across equivalent measurement conditions. Ask the same question with different surface wording, different framing, different order, and the response stays the same. This is what the reward model is supposed to be learning. Only this category is safe to aggregate in the way standard RLHF aggregates.

Non-attitudes are responses generated to satisfy the question without any stable underlying opinion. The respondent has never formed a view on the matter, but the measurement protocol demands an answer, so one gets produced. Non-attitudes are especially pervasive for value-laden questions — precisely the questions that matter most for alignment. Non-attitudes look like genuine preferences in a single measurement but fail the consistency test: re-ask the same respondent and you get a different answer because there was never a stable view to retrieve. Current RLHF treats these as noise to filter or minority views to downweight. The behavioral science view is different: non-attitudes contain no signal at all and should be excluded, not averaged with genuine preferences.

Constructed preferences are assembled on the spot from contextual cues and framing. The respondent is not uncertain (as in a non-attitude); they are producing a coherent answer that depends on the measurement context. Change the context — different anchoring, different comparison class, different framing — and you get a different coherent answer. This category carries real information, but about the interaction between person and context, not about a stable property of the person. RLHF treats constructed preferences as context-independent preferences and trains reward models on them as if they were. The result: reward models that look good on in-distribution evaluation but fail when the deployment context differs from the annotation context.

Measurement artifacts form a fourth related category: same question measuring different constructs for different respondents. One annotator interprets "helpful" as "completes the task"; another interprets it as "gives correct information even when unasked"; a third interprets it as "avoids making the user feel incompetent." They provide coherent, stable responses — each tracking a real preference of theirs — but they are not tracking the same thing. RLHF aggregates them as if they were.

The diagnostic criterion that separates these is consistency across equivalent measurement conditions. Genuine preferences pass; non-attitudes, constructed preferences, and measurement artifacts each fail in distinctive ways. Non-attitudes fail on re-ask (no stable view). Constructed preferences fail on context perturbation (context-dependent). Measurement artifacts fail on question rephrasing (different construct elicited). These are distinguishable empirically, and the distinction determines what should be done with each.

The operational implication is a pre-aggregation filtering step that RLHF currently lacks. Before training the reward model, submit annotation tasks to consistency protocols: re-ask selected items, perturb framings, rephrase questions. Responses that fail consistency tests are not aggregated as preferences; they are either excluded (non-attitudes), contextualized (constructed preferences), or routed to separate annotators (measurement artifacts). This is operationally demanding but conceptually necessary: the alternative is the status quo, in which Why do preference models favor surface features over substance? documents 40% divergences without being able to attribute them to a specific upstream cause.

The taxonomy also suggests why Can models learn to ignore irrelevant prompt changes? works as an output-side intervention. If the upstream measurement problem is consistency failure across equivalent conditions, then training models to be invariant to equivalent-condition perturbations is a downstream patch for the same underlying phenomenon: the system's current robustness against irrelevant cue variation.

Inquiring lines that read this note 114

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI alignment serve diverse human preferences at scale?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can alternative training methods improve on supervised fine-tuning for language models?

What structural factors drive popularity bias in recommendation systems?

Why do negative weights matter more than sparsity in item similarity?

How can persona representations reduce language model variance and improve task accuracy?

How do we evaluate AI systems when user perception misleads actual performance?

How can we distinguish genuine user preferences from measurement artifacts?

What dimensions of recommendation quality do standard metrics miss?

How should models express uncertainty rather than forced confident answers?

Can ensemble evaluation methods reduce bias more than single judges?

Can model confidence signals reliably improve reasoning quality and calibration?

How do we assign confidence and polarity scores to belief edges?

How should dialogue systems best leverage conversation history for retrieval?

How do retrieval systems handle feedback expressed as negations rather than preferences?

Can prompting strategies overcome LLM biases without model fine-tuning?

What makes few-shot prompting sufficient for critique-to-preference transformation without fine-tuning?

Is model self-awareness based on genuine introspection or pattern matching?

What distribution patterns appear across different theory-of-mind datasets?

What makes specific clarifying questions more effective than generic ones?

How can emotions function as reliable information in reasoning and cognitive systems?

How do social dynamics and selection effects compound in rating aggregates?

How can recommendation systems balance personalization with stability and coverage?

How do aggregate reward models systematically exclude minority user preferences?

How do formal dialogue structures reveal conversation coherence mechanisms?

What structural signals in user language reveal their unstated preferences and context?

Why should disagreement be treated as signal in collaborative reasoning?

How should human oversight be integrated with autonomous AI systems?

How do guardrails vary their refusal rates based on user demographics?

Is embodied interaction necessary for language meaning and genuine agency?

What fine-grained distinctions matter most for human situated action in categories?

What properties determine whether reward signals teach genuine reasoning?

Can we distinguish between genuine alignment and response quality bias in reward signals?

What constrains reinforcement learning's ability to expand model reasoning?

Are RLVR models worse than non-reasoning models for subjective annotation?

What mechanisms enable AI systems to generate and spread false beliefs?

Do deception features and honesty features track the same underlying property?

What prevents language models from reliably adopting diverse personas?

Can we detect superposition in LLM personality traits and stated preferences?

When does optimizing for quality undermine the value of diversity?

When does RLHF reduce diversity and when does it preserve semantic variation?

How do language models inherit human biases from training data?

Can aggregate survey realism coexist with unreliable fine-grained effects?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How does preference learning differ from supervised finetuning for reasoning?

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 157 in 2-hop network ·dense cluster Open in graph ↗

Do all annotation responses measure the same und… Are RLHF annotations actually measuring genuine hu… Why do preference models favor surface features ov… Why do reasoning models fail at predicting disagre… Can models learn to ignore irrelevant prompt chang… Why do LLM persona prompts produce inconsistent ou… Why do LLM judges fail at predicting sparse user p… Should AI alignment target preferences or social r… Can text summaries beat embeddings for personalize…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Are RLHF annotations actually measuring genuine human preferences? RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
the parent argument this taxonomy operationalizes
Why do preference models favor surface features over substance? Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
the 40% divergence as downstream symptom; this taxonomy points upstream
Why do reasoning models fail at predicting disagreement? RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
disagreement that should be preserved vs disagreement that signals non-attitude — current RLHF conflates them
Can models learn to ignore irrelevant prompt changes? Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency-as-diagnostic maps to consistency-as-training-objective
Why do LLM persona prompts produce inconsistent outputs across runs? Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
unstable-across-runs is the constructed-preference signature in simulated annotators
Why do LLM judges fail at predicting sparse user preferences? When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
verbal uncertainty estimation as an abstention analog for identifying non-attitudes
Should AI alignment target preferences or social role norms? Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?
the normative critique; this note is the measurement refinement that specifies what the inputs actually contain
Can text summaries beat embeddings for personalized reward models? When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
text summaries preserve the context that constructed preferences depend on, where scalar rewards lose it

Do all annotation responses measure the same underlying thing?

Inquiring lines that read this note 114

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4