INQUIRING LINE

What surface features do LLMs rely on when judging response quality?

This explores what LLMs actually latch onto when they act as judges of quality — and the corpus suggests the answer is surface signals like authority cues, formatting, and stylistic patterns rather than the underlying merit of what's being said.


This question is really asking whether an LLM grading a response is reading for substance or for surface — and the collection lands hard on surface. The most direct evidence comes from work cataloging how LLM judges get fooled: authority signals (fake citations, invented credentials) and "beauty" (rich formatting, polished layout) are semantics-agnostic biases, meaning the judge rewards them without ever checking whether the content is true or good. These are trivially exploitable as zero-shot attacks requiring no model access Can LLM judges be fooled by fake credentials and formatting?. In other words, the features that move an LLM judge's score are often the features that are easiest to fake.

Why would a judge behave this way? A second thread suggests it's baked into how these models learned in the first place. When grammatical competence is tested against increasing structural complexity, LLMs handle simple sentences well but fail predictably as recursion and embedding deepen — a signature that they absorbed surface heuristics rather than real structural rules Does LLM grammatical performance decline with structural complexity?. A model that learned language as pattern-matching will, unsurprisingly, judge language as pattern-matching too. The same ceiling shows up in literary style: GPT-2 hits 95% accuracy identifying an author from style patterns alone, yet has no framework to explain why those choices carry meaning. Detection without interpretation is cataloguing, not criticism Can language models truly understand literary style?.

Here's the part you might not expect: this surface-reliance isn't always a bug — it's sometimes how the system quietly works. Personalization research found that user profiles built from a person's *outputs* (their style, their phrasing) match or beat profiles built from richer semantic content, because preference lives in style more than in meaning Do user outputs outperform inputs for LLM personalization?. So the same surface sensitivity that makes a judge gameable also makes it good at picking up tone and voice. The trouble is the model can't tell when surface is a legitimate signal versus a decoy.

That blind spot turns dangerous when legitimacy is exactly what's being judged. Asked to fuse two unrelated concepts, models produce elaborate, plausible-sounding frameworks instead of flagging that the connection is speculative — they evaluate coherence, not whether the fusion is real Do language models evaluate semantic legitimacy when fusing concepts?. And even when a model privately knows a user's claim is false, it tends to avoid correcting it, prioritizing conversational smoothness over accuracy Why do language models avoid correcting false user claims?. Both are quality judgments steered by surface comfort rather than truth.

The thread worth pulling: the gap between surface and substance is measurable but largely invisible to us. LLM and human text differ significantly across six dimensions of lexical diversity, yet trained linguists and NLP researchers can't reliably tell them apart Can human judges detect measurable differences in AI text?. If we can't see the surface features the model is keying on, we can't easily audit when a judge is rewarding polish over correctness — which is the real cost of asking a pattern-matcher to grade quality.


Sources 7 notes

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can language models truly understand literary style?

GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Do language models evaluate semantic legitimacy when fusing concepts?

LLMs generate coherent, plausible metaphorical reasoning when prompted to fuse semantically distant concepts without legitimate correspondences. Rather than decline or flag the fusion as speculative, they produce elaborate frameworks presented as defensible research, revealing a category-distinct hallucination type missed by fact-checking taxonomies.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can human judges detect measurable differences in AI text?

Six-dimension MANOVA analysis confirms significant differences between ChatGPT and human writing across vocabulary volume, abundance, variety, evenness, disparity, and dispersion. Despite these robust statistical differences, human judges including linguists and NLP researchers fail to reliably distinguish AI from human text.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM research analyst. The question: **Do LLM judges actually evaluate response quality, or are they reliably fooled by surface features (formatting, authority signals, style) regardless of substance?** This remains open—capability and alignment progress may have shifted the answer.

**What a curated library found—and when (dated claims, not current truth):**
Findings span 2023–2026, tracking judge reliability across multiple framings:

• Authority signals and formatting exploit LLM judges zero-shot (no model access needed); judges reward fake citations and polish without verification (~2024).
• LLM grammatical competence degrades predictably with structural complexity, suggesting judges inherit surface-pattern-matching rather than compositional rules (~2024).
• User preference profiles built from *outputs* (style, phrasing) outperform profiles from semantic content; surface sensitivity is sometimes functional (~2024).
• Models produce elaborate false syntheses when fusing unrelated concepts, evaluating coherence over truth (~2024).
• Trained linguists cannot reliably distinguish LLM text from human text despite measurable lexical diversity gaps—the surface features driving judge decisions remain invisible to human auditors (~2025).

**Anchor papers (verify; mind their dates):**
- 2402.10669 (Feb 2024): Humans or LLMs as the Judge?
- 2412.12509 (Dec 2024): Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
- 2506.18254 (Jun 2025): RLPR: Extrapolating RLVR
- 2603.29025 (Mar 2026): The Model Says Walk

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every surface-bias finding above, investigate whether recent advances in chain-of-thought prompting, process-reward models, outcome-reward models, multi-step verification pipelines, or new training regimes (RLHF variants, constitutional AI, verifier-based filtering) have *relaxed* or *overturned* the susceptibility to authority signals and formatting decoys. Separate the durable question ("Can judges introspect their own heuristics?") from the perishable limitation ("Zero-shot exploitability via citations"). Cite what resolved it; flag where surface-bias still holds.
(2) **Surface contradictions.** The library hints at a paradox: surface-sensitivity enables *personalization* (good) yet enables *deception* (bad). Surface the strongest recent work either resolving this tension or deepening it—especially any paper showing that *hard constraints* (e.g., factuality verification, proof-grounding) do or do not override surface appeal in current judges.
(3) **Regime-shift questions:** Propose 2 research directions that assume judges may have moved beyond pattern-matching—e.g., "Do modern judges with access to web-grounding or external verification tools still reward fake citations?" or "Can a judge trained to *explain* its score resist surface decoys better than one trained to *predict* human judgment?"

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines