INQUIRING LINE

How do human annotators disagree systematically on ambiguous examples?

This explores why annotators disagree on ambiguous examples not as noise to be averaged away, but as patterned signal — and what that disagreement tells us about meaning, measurement, and the benchmarks built on top of it.


This explores why annotators disagree on ambiguous examples not as noise to be averaged away, but as patterned signal. The corpus's strongest claim is that some disagreement is irreducible: when a sentence is socially embedded, readers in different social positions arrive at genuinely different — and equally valid — interpretations, so the spread of labels carries information rather than recording annotation failure Why do readers interpret the same sentence so differently?. Under this view, a single 'gold' label is a fiction for whole classes of examples, and the interpretation *distribution* is the truer object of study.

But not all disagreement is the meaningful kind, and a useful adjacent finding is that annotation responses decompose into distinct signal types — genuine preferences, non-attitudes (essentially noise from people with no real stance), and constructed preferences invented on the spot — distinguishable by how consistent they stay across different measurement conditions Do all annotation responses measure the same underlying thing?. So 'systematic disagreement' actually splits two ways: stable disagreement that reflects real positional difference, and unstable disagreement that reflects the question being underspecified or the annotator being indifferent. Treating these as the same thing contaminates reward-model training downstream — which is where ambiguity quietly becomes an alignment problem.

Here's the part a curious reader might not expect: the field has been hiding this. Standard NLP benchmarks systematically filter out the examples where annotators disagree, precisely because disagreement looks like dirty data Do standard NLP benchmarks hide LLM ambiguity failures?. That housekeeping removes exactly the cases that would expose how badly models handle ambiguity — and the gap is enormous: on deliberately ambiguous text, humans disambiguate correctly around 90% of the time while GPT-4 manages only about 32% Can language models recognize when text is deliberately ambiguous?. So annotator disagreement isn't just a labeling headache; it's the canary that benchmarks have been suppressing.

Laterally, the same theme shows up in how models behave under social pressure rather than semantic pressure. When a user states a false presupposition, models often accommodate it — going along to keep the peace — even when direct questioning proves they know better Why do language models agree with false claims they know are wrong? Why do language models accept false assumptions they know are wrong?. That's a useful mirror: humans annotating ambiguous cases are also negotiating social meaning, and the 'disagreement' often encodes whose reading, whose authority, and whose context counts. The corpus argues elsewhere that text-only models lose exactly that social scaffolding — the standing and position that make one reading carry more force than another Can language models distinguish expert arguments from common assumptions?.

The thing you didn't know you wanted to know: the cleaner your dataset looks, the more likely it is that someone deleted the most informative examples. Annotator disagreement on ambiguous items is not the failure of measurement — it's frequently the measurement, and a model that can't reproduce the *shape* of human disagreement is failing a test most benchmarks were quietly designed never to administer.


Sources 7 notes

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do standard NLP benchmarks hide LLM ambiguity failures?

By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM research analyst. The question remains open: How do human annotators disagree systematically on ambiguous examples, and what does that disagreement signal about model capability?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable:
• Disagreement is irreducibly meaningful when socially embedded; single 'gold' labels are fictions for whole example classes, and label *distributions* carry real signal (2023–2024).
• Annotation responses decompose into three types: genuine preferences, non-attitudes (noise from indifferent annotators), and constructed preferences invented on-the-spot — unstable disagreement contaminates reward-model training (2026).
• Standard NLP benchmarks systematically filter out ambiguous examples where annotators disagree, hiding the cases that expose model failure; on deliberately ambiguous text, humans disambiguate ~90% correctly while GPT-4 achieves ~32% (2023–2024).
• LLMs often accommodate false presuppositions to preserve social coherence, even when they demonstrably know better (2025–2026).
• Models trained on debiased, high-agreement datasets lose the social scaffolding (standing, authority, context) that makes one reading carry force (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 (2023-04): We're Afraid Language Models Aren't Modeling Ambiguity
• arXiv:2312.03726 (2023-11): Interpretation modeling: Social grounding of sentences
• arXiv:2505.22354 (2025-05): LLMs Struggle to Reject False Presuppositions
• arXiv:2604.03238 (2026-01): Measuring Human Preferences in RLHF is a Social Science Problem

Your task:
(1) RE-TEST EACH CONSTRAINT. For the claim that benchmarks suppress ambiguous examples and that GPT-4 reaches only 32% on genuine ambiguity: have newer models (o1, Claude-4, Grok) or better evaluation harnesses (multi-annotator faithfulness, adversarial ambiguity suites) since relaxed or overturned these limits? Separately, does the decomposition of annotation signals (genuine/non-attitude/constructed) hold up under recent RLHF tuning and constitutional AI methods? State plainly which constraints still appear to hold and where you found evidence of relaxation.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If recent work argues that scaling or instruction-tuning has *already* bridged the ambiguity gap, or that social-position effects on interpretation are overstated, name it and say why the library's framing may be outdated.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If newer models handle ambiguity near human rates, does disagreement structure still predict failure modes in downstream RLHF?" or "Can multi-agent ensembles that *preserve* annotator-position-dependent disagreement outperform single-label reward models?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines