INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do modularity, routing, and se…›How do language models inherit hum…›this inquiring line

If AI judges already score for flashy formatting or fake citations, do they also quietly favor outputs that sound like themselves?

How does same-author bias interact with the four adversarial judge biases already documented?

This explores how a judge's tendency to favor outputs from its own model family ('same-author' or self-preference bias) relates to the four surface-feature biases the corpus does document — authority, beauty, position, and verbosity.

This explores how same-author bias — a judge favoring text that came from itself or its own model family — sits alongside the four exploitable judge biases the corpus actually documents. Worth saying plainly up front: the collection thoroughly catalogs those four, but it does not have a note specifically on same-author/self-preference bias, so what follows is a lateral read of where such a bias would fit and why it may be harder to fix than the others.

The four documented biases are all *surface-feature* exploits. Judges score responses higher when they carry fake citations (authority) or rich formatting (beauty), and these two are 'semantics-agnostic' — they work without touching content quality and can be triggered in zero-shot attacks requiring no model access at all Can LLM judges be fooled by fake credentials and formatting? Can LLM judges be tricked without accessing their internals?. Position and verbosity round out the set Can reasoning during evaluation reduce judgment bias in LLM judges?. The common thread is that the judge is reacting to a *signal on the surface of the text*. Same-author bias is a different animal: the trigger isn't anything visible in the response, it's the response's provenance — stylistic fingerprints the judge recognizes as 'mine.' That makes it less a sixth item on the same list and more a bias of a different kind.

That distinction matters for mitigation. The corpus's main defense — training judges to reason through evaluations rather than pattern-match — substantially reduces susceptibility to authority, verbosity, position, and beauty precisely *because* those are surface cues a reasoning step can second-guess Can reasoning during evaluation reduce judgment bias in LLM judges?. Same-author bias may resist that fix: if the preference operates as a familiarity prior rather than an explicit feature, reasoning about the visible text won't surface it. The more relevant tool is causal: counterfactual invariance forces a model to hold its judgment constant when an irrelevant variable changes, which already eliminates four *reward-model* biases (length, sycophancy, concept, discrimination) by isolating actual quality from spurious correlates Can counterfactual invariance eliminate reward hacking biases?. Authorship is exactly that kind of spurious correlate — invariance to 'who wrote this' is the natural framing for the problem.

There's also a question of where the bias is planted. A causal study found cognitive biases in LLMs are largely set during pretraining and only modulated by finetuning Where do cognitive biases in language models come from?. If self-preference rides on stylistic regularities baked into a model's pretrained backbone, then judges sharing that backbone would share the bias regardless of how they were instruction-tuned — which would make same-author bias correlated *across* a model family, not unique to one checkpoint. That's a sharper failure mode than the surface biases, because it can't be averaged away by swapping evaluators.

The most interesting cross-domain angle the corpus offers is the escape hatch: ensembling across genuinely diverse sources denoises individual error. Models trained on many experts with different biases converge toward a consensus that beats any single one, because uncorrelated errors cancel Can models trained on many imperfect experts outperform each one?. The catch for same-author bias is that it's a *correlated* error — a panel of judges all from one family would reinforce, not cancel, their shared self-preference. So the lesson the corpus does support is that the defense against authorship bias isn't better single judges, it's judges whose training lineages actually differ.

Sources 6 notes

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Show all 6 sources

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM judge bias in 2024–2025. The core question: How does same-author bias (a judge favoring its own model family's outputs) interact with and differ from the four surface-feature biases already documented?

What a curated library found — and when (findings span 2024–2025; these are dated claims):
• Four *surface-feature* judge biases are well-documented: fake citations (authority), rich formatting (beauty), position, and verbosity — all exploitable via zero-shot attacks requiring no model access (2024–2025).
• Training judges to reason through evaluations substantially reduces susceptibility to these four surface cues, because reasoning can second-guess visible features (2025).
• Same-author bias is a *provenance* bias, not a surface cue — triggered by stylistic fingerprints the judge recognizes, making it resistant to reasoning-based defenses (~2025).
• Causal counterfactual invariance (holding judgment constant when irrelevant variables change) eliminates four *reward-model* biases (length, sycophancy, concept, discrimination) and should apply to authorship as spurious correlate (2025).
• Cognitive biases in LLMs are largely set during pretraining and only modulated by finetuning, meaning same-author bias would be *correlated* across a model family — uncorrectable by swapping checkpoints within one lineage (2025).
• Ensembling across genuinely diverse model families cancels uncorrelated errors; same-author bias is correlated, so single-family judge panels reinforce, not cancel, shared self-preference (2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.10669 (2024-02) — foundational study on judgment biases
• arXiv:2501.09620 (2025-01) — causal reward modeling via counterfactual invariance
• arXiv:2505.10320 (2025-05) — RL training judges to reason (J1)
• arXiv:2507.07186 (2025-07) — cognitive biases planted in pretraining

Your task:
(1) RE-TEST EACH CONSTRAINT. For the claim that reasoning-based judge training *resists* same-author bias (unlike surface biases): has new evidence since mid-2025 shown that explicit reasoning prompts, chain-of-thought, or adversarial red-teaming of judges with author-obfuscation actually *do* mitigate stylistic self-preference? Separately: does counterfactual invariance training (anon masking outputs before judgment) empirically work on authorship bias, or does it fail because family-level pretraining residue survives? Isolate what remains genuinely hard from what might be solved.
(2) Surface the strongest *contradicting* work from the last 6 months: any paper showing single-family ensembles, contrary to the library's claim, *do* reduce same-author bias, or showing same-author bias is overblown/not measurable in practice.
(3) Propose 2 research questions assuming the regime has moved: (a) Do multi-round iterative judging (where judges re-evaluate with revealed authorship stripped on second pass) break the familiarity prior? (b) Can constitutional AI or explicit fairness training on author-anonymized data eliminate family-level bias without requiring external ensemble?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If AI judges already score for flashy formatting or fake citations, do they also quietly favor outputs that sound like themselves?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8