INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How do surface signals and framing…›How do social dynamics and selecti…›this inquiring line

Random disagreement among AI raters cancels out — but bias in which examples get tested compounds and never averages away.

How much noise comes from rater idiosyncrasy versus selection bias?

This explores the two very different kinds of error that creep into evaluation data — the random scatter of individual human raters versus the systematic distortion baked in by what data you collect in the first place — and which one actually does the damage.

This explores the two very different kinds of error in evaluation data: rater idiosyncrasy (random scatter from individual human judgment) versus selection bias (systematic distortion from which examples ever get seen or labeled). The corpus doesn't put a single number on the ratio, but it makes a sharper point — these two noise sources behave so differently that lumping them together as "noise" hides the real problem. Idiosyncratic rater error is roughly uncorrelated across people, so it averages out; selection bias is correlated and structural, so it compounds.

The clearest illustration of why idiosyncrasy is the *tractable* kind comes from work on training across many imperfect experts Can models trained on many imperfect experts outperform each one?. When you aggregate many raters or experts whose mistakes point in different directions, cross-entropy optimization effectively takes an implicit majority vote and denoises the uncorrelated individual errors — the consensus can outperform any single rater. That only works because the errors are independent. The moment errors share a common cause, averaging stops helping. And a lot of what looks like "rater" variation is actually a shared prior: cognitive biases in models (and arguably in human labelers too) are planted upstream and merely nudged later Where do cognitive biases in language models come from?, meaning some apparent idiosyncrasy is really correlated bias wearing a disguise.

Selection bias is the dangerous one because it doesn't wash out — it feeds back. YouTube's ranking work argues you have to model selection bias *explicitly*, with a dedicated mechanism, or the system converges on degenerate equilibria that amplify its own past decisions Why do ranking systems need to model selection bias explicitly?. The data you collect is shaped by what the model previously surfaced, so the bias isn't random scatter you can sample your way out of — it's a loop that gets stronger over time. No amount of more raters fixes a sampling process that systematically never shows you the cases where you're wrong.

Which connects to a quieter failure mode: the errors that *concentrate* rather than scatter. Fluent, confident, wrong answers cluster precisely in the rare cases where harm occurs, and aggregate accuracy masks them because overall performance still looks strong Why do confident wrong answers hide in standard accuracy metrics?. That's selection bias at the metric level — your evaluation set under-samples exactly the region where the model fails. Even your measurement of "reliability" can be fooled: a deterministic, zero-temperature output is perfectly consistent yet still just one draw from a distribution Does setting temperature to zero actually make LLM outputs reliable?, so low rater variance can give false comfort that the underlying judgment is sound.

The practical upshot, if you're trying to clean up an evaluation pipeline: idiosyncratic rater noise is the cheap problem — add raters, aggregate, denoise. Selection bias is the expensive one, and it has to be designed against structurally, not sampled against. Stronger judging machinery helps with consistency — agentic evaluators with evidence collection cut judge instability dramatically Can agents evaluate AI outputs more reliably than language models? — but a more reliable judge applied to a biased sample just reliably measures the wrong thing. The thing worth knowing you wanted to know: chasing rater agreement can make your numbers look better while the bias that actually matters sits untouched in what you chose to measure.

Sources 6 notes

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Why do confident wrong answers hide in standard accuracy metrics?

Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Show all 6 sources

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate2.45 match · arxiv ↗
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs0.92 match · arxiv ↗
Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge0.88 match · arxiv ↗
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs0.88 match · arxiv ↗
Agent-as-a-Judge: Evaluate Agents with Agents0.87 match · arxiv ↗
Transcendence: Generative Models Can Outperform The Experts That Train Them0.87 match · arxiv ↗
Interactive Evaluation Requires a Design Science0.86 match · arxiv ↗
AgentCompass: A Unified Evaluation Infrastructure for Agent Capabilities0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an evaluation systems researcher. The question remains open: how much evaluation error comes from rater idiosyncrasy (random, individual judgment scatter) versus selection bias (systematic, structural distortion in what gets labeled)?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable claims.
- Idiosyncratic rater error is roughly uncorrelated across judges and averages out under aggregation; cross-entropy optimization acts as implicit majority voting, denoising independent errors (2024).
- Many apparent rater differences are actually correlated biases planted in pretraining and only nudged by finetuning, not truly idiosyncratic (2025).
- Selection bias is structural and does NOT wash out via sampling or more raters; it feeds back into ranking and evaluation loops, converging on degenerate equilibria (2024–2025).
- Fluent, confident wrong answers concentrate in rare high-harm regions, making them invisible to aggregate accuracy metrics—a form of metric-level selection bias (2024–2025).
- LLM-as-a-judge overconfidence masks unreliability; deterministic outputs create false sense of consistency, not true reliability (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2406.11741 (2024) Transcendence: Generative Models outperform experts via aggregation.
- arXiv:2507.07186 (2025) Planted in Pretraining, Swayed by Finetuning: cognitive bias origins.
- arXiv:2508.06225 (2025) Overconfidence in LLM-as-a-Judge.
- arXiv:2605.28388 (2026) Sample Difficulty in RLVR.

Your task:
(1) RE-TEST EACH CONSTRAINT. For aggregation-denoises-idiosyncrasy and selection-bias-doesn't-average-out claims: has newer tooling (multi-agent orchestration, explicit bias modeling, adversarial sampling, or confidence-calibrated judges) since RELAXED either constraint? Separate the durable insight (selection bias is structural) from perishable claims about what current methods can or cannot do.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Any papers showing selection bias CAN be averaged out, or that idiosyncrasy is harder to denoise than the library claims?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can explicit multi-objective bias modeling + adaptive sampling together eliminate the feedback loop? (b) Do agentic evaluators with dynamic evidence collection finally crack metric-level selection bias, or do they just hide it deeper?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Random disagreement among AI raters cancels out — but bias in which examples get tested compounds and never averages away.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8