SYNTHESIS NOTE
Psychology, Society, and Alignment Language, Text, and Discourse

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Synthesis note · 2026-04-07 · sourced from Alignment
What kind of thing is an LLM really? What stops large language models from improving themselves?

Behavioral science's six-decade accumulation of preference elicitation research produces a taxonomy that RLHF practice collapses into a single signal. The three categories matter because they require different treatment — and treating them uniformly is the upstream mistake that Are RLHF annotations actually measuring genuine human preferences? argues contaminates the entire pipeline.

Genuine preferences manifest stably across equivalent measurement conditions. Ask the same question with different surface wording, different framing, different order, and the response stays the same. This is what the reward model is supposed to be learning. Only this category is safe to aggregate in the way standard RLHF aggregates.

Non-attitudes are responses generated to satisfy the question without any stable underlying opinion. The respondent has never formed a view on the matter, but the measurement protocol demands an answer, so one gets produced. Non-attitudes are especially pervasive for value-laden questions — precisely the questions that matter most for alignment. Non-attitudes look like genuine preferences in a single measurement but fail the consistency test: re-ask the same respondent and you get a different answer because there was never a stable view to retrieve. Current RLHF treats these as noise to filter or minority views to downweight. The behavioral science view is different: non-attitudes contain no signal at all and should be excluded, not averaged with genuine preferences.

Constructed preferences are assembled on the spot from contextual cues and framing. The respondent is not uncertain (as in a non-attitude); they are producing a coherent answer that depends on the measurement context. Change the context — different anchoring, different comparison class, different framing — and you get a different coherent answer. This category carries real information, but about the interaction between person and context, not about a stable property of the person. RLHF treats constructed preferences as context-independent preferences and trains reward models on them as if they were. The result: reward models that look good on in-distribution evaluation but fail when the deployment context differs from the annotation context.

Measurement artifacts form a fourth related category: same question measuring different constructs for different respondents. One annotator interprets "helpful" as "completes the task"; another interprets it as "gives correct information even when unasked"; a third interprets it as "avoids making the user feel incompetent." They provide coherent, stable responses — each tracking a real preference of theirs — but they are not tracking the same thing. RLHF aggregates them as if they were.

The diagnostic criterion that separates these is consistency across equivalent measurement conditions. Genuine preferences pass; non-attitudes, constructed preferences, and measurement artifacts each fail in distinctive ways. Non-attitudes fail on re-ask (no stable view). Constructed preferences fail on context perturbation (context-dependent). Measurement artifacts fail on question rephrasing (different construct elicited). These are distinguishable empirically, and the distinction determines what should be done with each.

The operational implication is a pre-aggregation filtering step that RLHF currently lacks. Before training the reward model, submit annotation tasks to consistency protocols: re-ask selected items, perturb framings, rephrase questions. Responses that fail consistency tests are not aggregated as preferences; they are either excluded (non-attitudes), contextualized (constructed preferences), or routed to separate annotators (measurement artifacts). This is operationally demanding but conceptually necessary: the alternative is the status quo, in which Why do preference models favor surface features over substance? documents 40% divergences without being able to attribute them to a specific upstream cause.

The taxonomy also suggests why Can models learn to ignore irrelevant prompt changes? works as an output-side intervention. If the upstream measurement problem is consistency failure across equivalent conditions, then training models to be invariant to equivalent-condition perturbations is a downstream patch for the same underlying phenomenon: the system's current robustness against irrelevant cue variation.

Inquiring lines that use this note as a source 113

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 159 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

annotation responses decompose into three distinct signal types — genuine preferences non-attitudes and constructed preferences — each requiring fundamentally different handling