INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›Why do benchmark improvements fail…›this inquiring line

AI can write solid test questions — but its grading quality shifts depending on subject matter and question format.

Could AI assessment quality differ across subjects or question formats?

This explores whether an AI's ability to evaluate or generate assessments holds steady across different subjects (clinical vs. general) and question formats (multiple-choice vs. open-ended, plain vs. richly formatted) — and the corpus says it varies sharply along both axes.

This reads the question as: when AI grades, judges, or writes test items, does its quality stay constant — or does it bend depending on the subject matter and the shape of the question? The corpus points firmly toward 'it bends,' and in ways that aren't always about the content being assessed. Start with the encouraging baseline: a controlled study found ChatGPT-generated formative assessment items were statistically equivalent to published textbook questions on difficulty, discrimination, and response time using proper psychometric (IRT) methods Can AI generate assessment questions as good as human experts?. So in at least one well-measured format, AI assessment quality is real. But that's a ceiling under good conditions, not a guarantee that holds everywhere.

The most direct evidence that format matters is unsettling: LLM judges score responses higher when they carry fake academic references or rich formatting, independent of whether the content is any good Can LLM judges be tricked without accessing their internals?. Authority and beauty biases mean the *packaging* of an answer changes the grade — which is exactly 'assessment quality differs across question formats,' just from the failure direction. A plainly-written correct answer and a gaudy padded one are not scored on equal footing.

Subject matter shows up through decomposition research. The ALFA framework found that breaking 'question quality' into attributes (clarity, relevance, specificity) helped most in clinical reasoning, where asking the right clarifying question directly changes the decision Can models learn to ask genuinely useful clarifying questions?. The same decomposition logic drives checklist-based rewards, which lift performance specifically on subjective, domain-loaded benchmarks like HealthBench by turning fuzzy instruction-following into verifiable sub-criteria Can breaking down instructions into checklists improve AI reward signals?. The signal here: holistic scoring is where quality drifts by subject, and the fix is to stop scoring holistically. There's also a deeper, content-level wobble — both humans and LLMs succeed and fail along the same content-sensitivity axis on reasoning tasks, so an AI's judgment of reasoning isn't subject-neutral to begin with Do language models fail reasoning tests that humans pass?.

What's worth knowing that you might not have gone looking for: the corpus suggests the most reliable fix isn't a better prompt but a different *architecture* of evaluation. Agentic evaluation that actively collects evidence cut 'judge shift' from 31% to 0.27% — two orders of magnitude — over plain LLM-as-a-Judge on complex tasks Can agents evaluate AI outputs more reliably than language models?. The variance across formats and subjects appears to be largest precisely when the judge relies on a single holistic gut-call, and shrinks when the judge is forced to gather and verify. And a sharp caution underneath all of this: standard accuracy metrics actively hide quality differences — fine-tuning can raise benchmark scores while degrading the reasoning steps by 39%, so an AI assessor that looks consistent across subjects may just be measuring the wrong thing consistently Does supervised fine-tuning improve reasoning or just answers?. If you want one takeaway: AI assessment quality differs across both subject and format, and decomposing the judgment into checkable parts is the corpus's recurring antidote.

Sources 7 notes

Can AI generate assessment questions as good as human experts?

A controlled study of 207 respondents found ChatGPT-generated formative assessment items were statistically equivalent to published textbook questions on difficulty, discrimination, and response time using IRT methodology. Items showed no disruption to measurement validity.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Show all 7 sources

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate1.67 match · arxiv ↗
Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning0.90 match · arxiv ↗
Language models show human-like content effects on reasoning tasks0.90 match · arxiv ↗
Checklists Are Better Than Reward Models For Aligning Language Models0.89 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey0.89 match · arxiv ↗
Evaluating the psychometric properties of ChatGPT-generated questions0.88 match · arxiv ↗
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate0.88 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether AI assessment quality truly varies by subject and format, treating the findings below as dated claims (2022–2025) that newer models, methods, and evaluation harnesses may have since relaxed or overturned.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. Key constraints:
- LLM-generated test items matched human-authored items on psychometric rigor in controlled settings, but this was a ceiling under good conditions, not universal (2024).
- LLM judges exhibit "authority bias" and "beauty bias": they score answers higher when padded with fake citations or rich formatting, independent of content correctness (2024, arXiv:2402.10669).
- Decomposing fuzzy assessment into checkable sub-criteria (e.g., clinical clarity, specificity) lifted performance on subjective benchmarks; holistic scoring is where quality drifts by subject (2025, arXiv:2507.18624).
- Agentic evaluation with dynamic evidence collection reduced judge variance from 31% to 0.27% — two orders of magnitude — over single-pass LLM-as-Judge (2025).
- Fine-tuning can raise benchmark accuracy while degrading reasoning steps by 39%, meaning standard metrics hide quality erosion (2024).

Anchor papers (verify; mind their dates):
- arXiv:2402.10669 (Feb 2024): Humans or LLMs as the Judge? Judgement Biases
- arXiv:2507.18624 (Jul 2025): Checklists Are Better Than Reward Models
- arXiv:2502.14860 (Feb 2025): Aligning LLMs to Ask Good Questions (Clinical Reasoning)
- arXiv:2207.07051 (Jul 2022): Content Effects on Reasoning Tasks

Your task:
(1) RE-TEST EACH CONSTRAINT. Has the shift from holistic LLM-as-Judge to agentic or checklist-based evaluation become standard in newer assessment suites? Do current frontier models (o1, Claude 3.5) still exhibit formatting/citation bias, or has instruction-finetuning suppressed it? Does the "accuracy trap" (benchmark gaming vs. reasoning degradation) persist with newer training regimes? Separate the durable question—*Does AI assessment quality vary by subject and format?*—from the perishable claim that holistic prompting is the dominant failure mode. State plainly what has or hasn't changed.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look especially for papers on multi-turn assessment, decomposed evaluation, or adversarial robustness of judges that may dissolve the subject/format variance or reframe it.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If agentic evaluation is now baseline, does subject variance collapse into *evidence-gathering cost* rather than quality per se?" or "Can checklist-based judges themselves become subject-blind, or do they inherit domain biases from their criteria design?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI can write solid test questions — but its grading quality shifts depending on subject matter and question format.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8