INQUIRING LINE

Why do high-disagreement tasks benefit from broad rater pools over deep annotation?

This explores why tasks where annotators legitimately disagree are better served by sampling many different raters (breadth) than by having a few raters label intensively (depth).


This reads the question as being about *where the signal actually lives* when a task provokes disagreement — and the corpus suggests the answer is breadth, because on high-disagreement tasks the disagreement itself is the data, not noise to be averaged away. The starting point is Why do readers interpret the same sentence so differently?: when sentences are socially embedded, different readers interpret them differently because of where they sit, not because someone made a mistake. If the variation across people is the real signal, then deep annotation by a narrow set of raters can't recover it — you'd just be sampling one or two perspectives very precisely while missing the distribution entirely. Breadth captures the shape of legitimate disagreement; depth sharpens a single, possibly idiosyncratic, point of view.

There's a second reason breadth matters, and it's about what a single annotator's response even *is*. Do all annotation responses measure the same underlying thing? shows that annotations aren't one clean measurement — they mix genuine preferences, non-attitudes (people answering when they have no real opinion), and constructed preferences (opinions invented on the spot). You can only tell these apart by looking across measurement conditions and across people. Annotating one rater deeply gives you consistency, but consistency can't distinguish a stable genuine preference from a stably-constructed artifact. A broad pool lets the genuine signal accumulate while the non-attitudes and constructions wash out as scatter — exactly the separation deep annotation can't perform.

The failure mode of ignoring this shows up downstream in Why do reasoning models fail at predicting disagreement?: models optimized for a single deterministic "correct" answer get *worse* at representing human disagreement, and worst of all precisely where variance is high. That's the modeling-side mirror of narrow annotation — collapsing many valid views into one erodes the very capability you need on contested tasks. Broad rater pools are the data-collection counterpart to keeping that distribution alive instead of training it away.

The quiet payoff is a reframing: on contested tasks, "more annotation" and "better annotation" point in opposite directions. Depth buys you precision about the wrong quantity — one perspective's certainty — when the quantity you actually need is the spread across perspectives. The corpus also hints this isn't limited to subjective social text: Can models learn argument quality from labeled examples alone? finds that without an explicit shared framework, raters latch onto surface patterns instead of principled criteria, which is another way disagreement leaks in — and another reason a single deep annotator can quietly encode their own idiosyncratic surface rules rather than a criterion anyone else would share.


Sources 4 notes

Why do readers interpret the same sentence so differently?

Interpretation Modeling research shows that disagreement on socially embedded sentences reflects valid differences in reader perspective, not annotation failure. Structured human disagreement in NLI benchmarks confirms that interpretation distributions carry meaningful information.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do reasoning models fail at predicting disagreement?

RLVR-trained models degrade significantly at predicting human disagreement distributions, especially when variance is high. The optimization signal for deterministic correctness actively erodes the model's ability to represent multiple valid interpretations.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing a claim about annotation design: *high-disagreement tasks benefit from broad rater pools over deep annotation because disagreement itself is the signal, not noise*. This claim rests on work from 2023–2026 in interpretation modeling, annotation theory, and RLHF. Treat those findings as dated; your job is to test whether they still constrain practice.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026:
• Annotations decompose into genuine preferences, non-attitudes, and constructed preferences — only breadth separates them (2023–2024).
• Models optimized for single deterministic answers *degrade* at predicting high-variance human disagreement, especially in contested domains (2024–2025).
• Without explicit shared frameworks, raters encode idiosyncratic surface patterns rather than principled criteria, making deep annotation dangerously narrow (2024).
• Recent work frames measuring human preferences in RLHF as a *social science problem*, implying breadth and context matter (2026).
• High-entropy minority tokens in RL data drive effective training more than consensus tokens (2025).

Anchor papers (verify; mind their dates):
• arXiv:2312.03726 (2023) — Interpretation modeling: social grounding of sentences.
• arXiv:2506.19467 (2025) — Can LLMs capture human annotator disagreement?
• arXiv:2604.03238 (2026) — Measuring human preferences in RLHF is a social science problem.
• arXiv:2506.01939 (2025) — High-entropy minority tokens drive RL effectiveness.

Your task:
(1) RE-TEST the claim that *deep annotation fails* on disagreement-heavy tasks. Probe: Have newer RLHF methods (post-completion learning, self-feedback, reinforcement from uncertainty) changed whether narrow annotators with explicit training can reliably capture distribution? Does adaptive retrieval or multi-agent reasoning now let a single expert simulate breadth? Separate the durable insight — disagreement is signal — from the perishable constraint — you must use many raters to recover it.
(2) Surface the strongest work contradicting narrow annotation. The 2026 paper on RLHF as social science may already complicate the breadth-over-depth thesis; flag any recent result showing depth + meta-annotation (annotating agreement itself) can rival breadth.
(3) Propose two questions assuming the regime has moved: (a) Can *uncertainty-aware* single annotators + mechanistic interpretability of their own disagreement with themselves replace population-level breadth? (b) Does fine-tuning LLMs to *reason about disagreement structures* (rather than averaging them) make shallow annotation viable for high-variance tasks?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines