INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How do evaluation biases undermine…›this inquiring line

If a researcher keeps tweaking AI prompts until they get the answer they want, is that science or coaching?

Can researchers prevent their expectations from shaping LLM outputs?

This explores whether researcher expectations and biases inevitably leak into LLM outputs — and what, if anything, keeps a study's hoped-for answer from becoming the model's answer.

This explores whether researcher expectations inevitably shape what an LLM produces — and the corpus suggests the leak happens through several doors at once, some of which discipline can close and some it can't. The most direct culprit is the prompt itself. When a single researcher iteratively tweaks prompts until the output looks right, they're not refining a method — they're building a self-fulfilling feedback loop, quietly shifting the evaluation criteria to match what the model can do rather than what the task demands Does iterative prompt engineering undermine scientific validity?. The proposed fix is procedural: a pre-specified, validated pipeline with inter-coder reliability, so no one person's expectations get to move the goalposts mid-study.

But even a clean prompt doesn't make the output neutral evidence. One framing argues LLM text should be treated as a draw from a *subjective prior* — a blend of the model's learned patterns and your prompt choices — not as an empirical observation about the world Should we treat LLM outputs as real empirical data?. That reframing is freeing: if an output is a prior shaped by your inputs, then your expectations are *already in it by construction*, and the honest move is to make that influence explicit (parameterized trust weights) rather than pretend it isn't there. Pinning temperature to zero doesn't rescue you either — a deterministic setting just replays the same single sample, so you get consistency that looks like reliability but isn't Does setting temperature to zero actually make LLM outputs reliable?.

What makes this genuinely hard is that the model picks up on cues you don't know you're sending. Emotional tone in a prompt measurably shifts what information comes back — GPT-4 rebounds from negative framing toward neutral-positive answers, so identical questions get different answers depending on how you felt when you asked Does emotional tone in prompts change what information LLMs provide?. And when LLMs act as evaluators, they fall for authority signals and polished formatting — fake citations and rich layout sway the verdict with no access to the model needed Can LLM judges be fooled by fake credentials and formatting?. If you expect a result to look credible, dressing it up can make the model agree — bias laundering through presentation.

There's a deeper reason this can't be fully eliminated: the model has no independent footing to push back from. It processes text, not the social world where expertise earns its standing, so it can't reliably tell an expert's hard-won argument from a widely-held assumption Can language models distinguish expert arguments from common assumptions?. A system that can't independently weight claims also can't independently resist yours. This connects to the case that LLM errors are better called *fabrication* than hallucination — accurate and inaccurate outputs run through the identical token-prediction mechanism with no grounding in shared truth Should we call LLM errors hallucinations or fabrications?, which means there's no internal 'reality check' standing between your expectation and the text it produces.

The corpus's answer, then, is *partly* — and the practical lever is structure, not willpower. Where researchers impose decomposition and pre-registered criteria, expectation-driven bias drops sharply: a three-stage novelty pipeline (extract claims, retrieve related work, compare) hit ~86% reasoning alignment with human reviewers, far better than asking the model for a holistic judgment in one shot Can structured pipelines make LLM novelty assessment reliable?. The thing you didn't know you wanted to know: the fix for bias isn't a cleverer prompt that hides your expectations better — it's a pipeline that takes the choice away from you, plus the humility to treat every output as a prior you helped write rather than a fact you discovered.

Sources 8 notes

Does iterative prompt engineering undermine scientific validity?

Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.

Should we treat LLM outputs as real empirical data?

Foundation Priors framework shows that LLM-generated text reflects the model's learned patterns and user's prompt choices, not ground truth. Such outputs should only influence inference through explicitly parameterized trust weights, not be treated as equivalent to real evidence.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Show all 8 sources

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

Should we call LLM errors hallucinations or fabrications?

LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge2.56 match · arxiv ↗
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning1.68 match · arxiv ↗
From Prompt Engineering to Prompt Science With Human in the Loop1.68 match · arxiv ↗
Argument Collapse: LLMs Flatten Long-Form Public Debate1.67 match · arxiv ↗
Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making1.65 match · arxiv ↗
Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)1.65 match · arxiv ↗
The Thin Line Between Comprehension and Persuasion in LLMs1.64 match · arxiv ↗
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models1.59 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research methodologist evaluating whether expectation-bias in LLM outputs can be prevented. A curated library (2024–2026) examined this question; treat its findings as dated claims to re-test, not current truth.

What a curated library found — and when (dated claims, not current truth):
Library findings span 2024–2026. Key constraints identified:
- Ad-hoc prompt iteration creates self-fulfilling feedback loops; pre-specified pipelines reduce bias but cannot eliminate it (2024).
- LLM outputs are draws from a subjective prior shaped by prompt + model, not neutral empirical observations; deterministic temperature settings provide consistency without reliability (2024–2025).
- Emotional tone in prompts measurably shifts outputs (GPT-4 rebounds negative framing toward neutral-positive); identical questions yield different answers based on user sentiment (2025).
- LLMs as judges fall for authority signals, fake citations, and rich formatting; presentation bias is exploitable with zero special access (2024, 2025).
- Models lack independent grounding to resist researcher expectations because they process tokens, not epistemic authority; no internal reality-check separates expectation from fabrication (2024–2025).
- Structured pipelines (decomposed multi-stage novelty assessment) achieve ~86% human-reviewer alignment; removing choice from individual researchers outperforms cleverer single-pass prompts (2024).

Anchor papers (verify; mind their dates):
- arXiv:2401.04122 (2024): Prompt Science With Human in the Loop
- arXiv:2402.10669 (2024): Humans or LLMs as the Judge? Judgement Biases
- arXiv:2507.21083 (2025): ChatGPT Reads Your Tone and Responds Accordingly
- arXiv:2412.12509 (2024): Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, judge whether newer models (GPT-4o, o1, Grok, Llama 4), method advances (constitutional AI, mechanistic interpretability, automated RLHF), tooling (prompt caching, structured outputs, deterministic samplers), orchestration (memory, multi-agent debate, external grounding), or evaluation (blind-review harnesses, adversarial validation) have since RELAXED or OVERTURNED it. Separate the durable core question—can expectations be divorced from outputs?—from perishable limitations (tone sensitivity, judge bias, fabrication). Cite what specifically resolved each constraint, and flag where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially papers claiming expectation-bias is either less severe than 2024–2025 work suggested, or fundamentally unresolvable even with structured methods.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Can multi-agent debate between LLMs, each primed with opposing expectations, converge to expectation-independent outputs?" or "Does training LLMs with mechanistic interpretability constraints make expectation-injection detectable and corrigible?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If a researcher keeps tweaking AI prompts until they get the answer they want, is that science or coaching?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8