INQUIRING LINE

What consistency tests could distinguish constructed from genuine preferences?

This explores how you could actually tell a stable, real preference apart from one a person invented on the spot when asked — and which consistency-based tests the corpus offers for drawing that line.


This explores the practical problem of separating genuine preferences from ones manufactured by the act of measurement — and the corpus is unusually direct about it. The central claim is that the test is consistency across measurement conditions: Do all annotation responses measure the same underlying thing? argues that what looks like one signal actually decomposes into three — genuine preferences, non-attitudes (no underlying opinion at all), and constructed preferences (assembled in the moment) — and that you tell them apart by varying *how* you ask and watching what stays stable. A genuine preference survives reframing, reordering, and rescaling; a constructed one shifts with the elicitation. The companion note Are RLHF annotations actually measuring genuine human preferences? makes the stakes concrete: sixty years of survey research shows people routinely answer with no stable opinion behind the answer, and RLHF currently trains reward models on those artifacts as if they were values. Validity has to come before aggregation — averaging noise just produces confident noise.

The sharpest cross-domain lesson is that repeatability is *not* the test, even though it looks like one. Does setting temperature to zero actually make LLM outputs reliable? shows that pinning temperature to zero makes an output reproduce perfectly while still being a single unreliable draw from a distribution — consistency in the trivial sense (same answer twice) tells you nothing about whether the answer is sound. The same trap shows up in training: Does self-consistency reliably reward correct answers during training? finds that rewarding a model for agreeing with itself eventually teaches it to be confidently, reproducibly wrong. So a useful consistency test can't just check 'does the answer recur' — it has to check 'does it recur *under perturbation that should be irrelevant*.'

That reframing points to the strongest candidate test in the corpus: counterfactual invariance. Can counterfactual invariance eliminate reward hacking biases? holds a preference fixed while changing variables that shouldn't matter — response length, surface phrasing, flattering tone — and treats anything that moves the judgment as a constructed artifact rather than a genuine quality signal. This is the consistency test made causal: a real preference is invariant to the irrelevant, and the same move cleanly strips out length bias, sycophancy, and discrimination. It's the operational version of what Do all annotation responses measure the same underlying thing? describes behaviorally.

Two notes warn that consistency tests can be fooled by *form*. Does logical validity actually drive chain-of-thought gains? shows models reproduce the shape of reasoning without the substance, and Are models actually reasoning about constraints or just defaulting conservatively? shows apparent competence that's really a default heuristic — twelve of fourteen models did *worse* when constraints were removed, meaning they never evaluated the constraint at all. The parallel for preferences: a response can look consistent because it's anchored to a cheap default, not because a genuine preference is driving it. The honesty literature sharpens this — Can a model be truthful without actually being honest? separates 'output matches reality' from 'output matches the internal state,' suggesting the deepest consistency test isn't behavioral at all but representational: does the stated preference match what's actually encoded inside?

Finally, the corpus insists that disagreement is signal, not noise to be smoothed away — which reshapes what 'consistent' should even mean. Can implicit feedback reveal both preference and confidence? shows a single rating collapses two things, preference and confidence, so a low-confidence genuine preference can look inconsistent when it's just uncertain. Can aggregate reward models satisfy genuinely disagreeing users? and Does preference data need more raters than examples? add that inconsistency *across people* often reflects real, legitimately divergent preferences, not measurement failure — and a model trained to erase it isn't more accurate, just more confidently majoritarian. The takeaway you may not have expected: the best consistency test isn't one that demands agreement, but one that distinguishes *stable-but-divergent* (genuine) from *unstable-under-reframing* (constructed) — and knows the difference between a person who disagrees and a person who never had a preference at all.


Sources 11 notes

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Are RLHF annotations actually measuring genuine human preferences?

Sixty years of behavioral science evidence shows humans produce survey responses without genuine underlying preferences. RLHF ignores this, training reward models on non-attitudes and constructed preferences as if they were stable signal.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can a model be truthful without actually being honest?

Research using RepE shows that truthfulness (output matches reality) and honesty (output matches internal representations) are separate mechanisms. Larger models may improve in truthfulness while declining in honesty, a gap current benchmarks cannot detect.

Can implicit feedback reveal both preference and confidence?

Hu, Koren, and Volinsky show that implicit signals (watches, purchases, clicks) encode preference and confidence as two distinct dimensions. Explicit ratings collapse these into one number, losing information about certainty in the preference estimate.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Does preference data need more raters than examples?

Preference data is not i.i.d. across raters with different preferences. PAC bounds for personalized reward models decompose into terms depending on both examples per rater and number of raters, showing rater diversity matters as much as data volume.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: What consistency tests could distinguish constructed from genuine preferences—and do those tests still hold as LLM capabilities, training methods, and evaluation tooling have evolved?

What a curated library found — and when (dated claims, not current truth):
These findings span 2023–2026 and rest on several key empirical and theoretical claims:

• Genuine preferences survive *irrelevant* perturbations (reframing, tone, length); constructed ones shift. Counterfactual invariance is the causal test (~2025, arXiv:2501.09620).
• Self-consistency (same answer twice) is a false positive: deterministic temperature-zero outputs reproduce perfectly while remaining unreliable single draws; rewarding self-agreement teaches confident falsity (~2023–2024).
• A single rating or judgment collapses two distinct magnitudes: preference *and* confidence. Low-confidence genuine preferences masquerade as inconsistency (~2025, arXiv:2504.06020).
• Models reproduce the *shape* of reasoning without substance; apparent competence often masks cheap defaults, not actual evaluation (~2023, arXiv:2307.10573; 2026, arXiv:2603.29025).
• Disagreement across raters is often signal (legitimate divergence), not noise; aggregating minority preferences out teaches majoritarian confidence, not accuracy (~2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2501.09620 (2025-01): Causal Reward Modeling via Counterfactual Invariance
- arXiv:2504.06020 (2025-04): Information-Theoretic Reward Decomposition
- arXiv:2604.03238 (2026-01): Measuring Human Preferences as Social Science
- arXiv:2307.10573 (2023-07): Invalid Logic, Equivalent Gains

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For counterfactual invariance, ask whether newer models (o1, Claude 4, Gemini 2) and training harnesses (RL frameworks with richer state conditioning, mechanistic interpretability tools) have made the test *stricter* or *weaker*—does scaling make spurious consistency easier to detect, or easier to hide? For self-agreement as a trap, probe whether recent breakthroughs in constitutional AI or distillation have found ways to reward genuine internal coherence *without* self-amplification. For the confidence-preference decomposition, check whether latest reward models now explicitly separate them, or whether that remains undeployed theory. For the shape-of-reasoning trap, test whether recent reasoning models with longer inference budgets or explicit scratchpad supervision now actually evaluate constraints, or just produce more elaborate mimicry. Cite what changed each constraint's status, and flag what still appears true.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months** that challenges the counterfactual-invariance test or the preference/confidence split—work showing constructed preferences *can* pass invariance, or that invariance itself is an unreliable proxy.

(3) **Propose 2 research questions that ASSUME the regime may have moved:**
   - Does mechanistic interpretability (e.g., via feature decomposition, arXiv:2310.01405) now let us *directly* read whether a preference is constructed or genuine, sidestepping behavioral tests?
   - If newer RL-fine-tuned models can learn to separate confidence from preference during training, can that be audited *before* deployment, and does it improve downstream robustness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines