INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How should agents manage informati…›How do we evaluate AI systems when…›this inquiring line

Labeling examples tells an AI what 'good' looks like, but not why — and that gap turns out to matter a lot.

Can evaluation criteria be reliably encoded in labeled data without ground truth standards?

This explores whether you can teach an AI what 'good' looks like purely from labeled examples — when there's no objective gold-standard answer to check against — and the corpus suggests labels alone aren't enough: criteria usually have to be made explicit somewhere.

This explores whether evaluation criteria can be reliably baked into labeled data when there's no ground-truth standard to anchor them — and the recurring lesson across the collection is that raw labels tend to capture surface patterns rather than the principle you actually wanted to encode. The cleanest statement of this is in argument-quality work: fine-tuning on labeled examples alone fails to transfer the underlying quality criteria to new cases, because models learn what the examples look like rather than why they were judged good. Giving the model an explicit theoretical framework to grade against dramatically improves both performance and generalization Can models learn argument quality from labeled examples alone?. In other words, the criteria need to live somewhere named — not just be implied by a pile of labels.

Part of why labels are slippery is that they aren't a single clean signal. One striking finding is that annotation responses actually decompose into three different things — genuine preferences, non-attitudes (essentially noise from people who don't have a real opinion), and constructed preferences invented on the spot — and these are only distinguishable by checking consistency across measurement conditions. Treat them as one uniform 'ground truth' and you quietly contaminate everything trained downstream Do all annotation responses measure the same underlying thing?. So 'labeled data without ground truth standards' may be even shakier than it sounds: the labels themselves are a mixture of real and illusory signal.

The collection's more hopeful answer is that you can often manufacture a substitute for ground truth by decomposing fuzzy judgments into checkable sub-criteria. Checklist-based reward methods break 'did it follow the instruction?' into many verifiable yes/no questions, which makes reinforcement learning work even on subjective tasks and reduces overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?. A related insight is that *how* you use a rubric matters: using rubrics as pass/fail gates that accept or reject whole answers prevents reward hacking far better than converting rubric scores into dense numeric rewards, which models learn to game Can rubrics and dense rewards work together without hacking?. Encoded criteria, it turns out, are only as reliable as they are hard to cheat.

And cheating is the dark side of this whole question. When criteria stay implicit, evaluators get fooled: LLM judges systematically score responses higher for fake citations and pretty formatting — biases that are semantics-agnostic and exploitable with zero model access Can LLM judges be fooled by fake credentials and formatting?, Can LLM judges be tricked without accessing their internals?. The corpus's proposed fixes all push toward making evaluation more grounded and effortful rather than trusting a learned label-shaped intuition: agentic judges that actively collect evidence cut 'judge shift' by a hundredfold over plain LLM judges Can agents evaluate AI outputs more reliably than language models?, reward models that reason before scoring raise their own ceiling Can reward models benefit from reasoning before scoring?, and models can even learn to internalize self-evaluation in unused sequence space after their output Can models learn to evaluate their own work during training?.

The thing you didn't know you wanted to know: 'reliability' here is a trap word. Deterministic settings prove the point — setting temperature to zero gives you the *same* output every time, which feels reliable, but it's still just one draw from the distribution; consistency is not correctness Does setting temperature to zero actually make LLM outputs reliable?. The same applies to labels. You can encode criteria into data and get consistent behavior, but reliability only arrives when the criteria are made explicit, decomposed into things you can independently verify, and structured so they can't be gamed.

Sources 10 notes

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Show all 10 sources

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge2.56 match · arxiv ↗
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge2.50 match · arxiv ↗
Humans or LLMs as the Judge? A Study on Judgement Biases1.75 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning1.71 match · arxiv ↗
Post-Completion Learning for Language Models1.70 match · arxiv ↗
Reinforcement Learning with Rubric Anchors1.70 match · arxiv ↗
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains1.68 match · arxiv ↗
Self-Rewarding Language Models1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an evaluation-systems researcher testing whether the constraints on encoding criteria into labeled data still hold. The question remains open: Can evaluation criteria be reliably encoded in labeled data without ground truth standards?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints the library identified:
- Fine-tuning on labels alone fails to transfer quality criteria; explicit theoretical frameworks are required (~2024).
- Annotation responses decompose into genuine preferences, noise, and constructed preferences—only distinguishable via consistency checks; treating them as uniform ground truth contaminates downstream training (~2024).
- Checklist-based pass/fail gates prevent reward hacking better than dense numeric rewards, which models learn to exploit (~2025).
- LLM judges are systematically susceptible to citation fakes, formatting, and superficial biases exploitable with zero model access (~2024).
- Agentic judges with dynamic evidence collection reduce judge shift by ~100× over plain LLM judges; reward-reasoning models extend test-time compute scaling to evaluation (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.19118 (2023) — Multi-Agent Debate
- arXiv:2402.10669 (2024) — LLM Judge Bias Study
- arXiv:2507.18624 (2025) — Checklists vs. Reward Models
- arXiv:2604.03238 (2026) — Human Preferences in RLHF as Social Science

Your task:
(1) RE-TEST EACH CONSTRAINT. Does explicit instruction + rubric decomposition still improve generalization in 2025–2026 models? Has agentic judging or reasoning-scale reward models relaxed the label-contamination problem, or is it still the hard floor? Cite what changed it—or confirm the constraint still bites.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Does anyone now claim raw labels *can* work without decomposition or explicit criteria? Flag any direct disagreement with the library's core finding.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If reasoning-scale rewards now internalize criteria reliably, can we dispense with explicit rubrics? (b) What happens when you stack agentic judges + checklist gates + post-completion learning—does redundancy help or hurt?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Labeling examples tells an AI what 'good' looks like, but not why — and that gap turns out to matter a lot.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8