Can researchers prevent their expectations from shaping LLM outputs?
This explores whether researcher expectations and biases inevitably leak into LLM outputs — and what, if anything, keeps a study's hoped-for answer from becoming the model's answer.
This explores whether researcher expectations inevitably shape what an LLM produces — and the corpus suggests the leak happens through several doors at once, some of which discipline can close and some it can't. The most direct culprit is the prompt itself. When a single researcher iteratively tweaks prompts until the output looks right, they're not refining a method — they're building a self-fulfilling feedback loop, quietly shifting the evaluation criteria to match what the model can do rather than what the task demands Does iterative prompt engineering undermine scientific validity?. The proposed fix is procedural: a pre-specified, validated pipeline with inter-coder reliability, so no one person's expectations get to move the goalposts mid-study.
But even a clean prompt doesn't make the output neutral evidence. One framing argues LLM text should be treated as a draw from a *subjective prior* — a blend of the model's learned patterns and your prompt choices — not as an empirical observation about the world Should we treat LLM outputs as real empirical data?. That reframing is freeing: if an output is a prior shaped by your inputs, then your expectations are *already in it by construction*, and the honest move is to make that influence explicit (parameterized trust weights) rather than pretend it isn't there. Pinning temperature to zero doesn't rescue you either — a deterministic setting just replays the same single sample, so you get consistency that looks like reliability but isn't Does setting temperature to zero actually make LLM outputs reliable?.
What makes this genuinely hard is that the model picks up on cues you don't know you're sending. Emotional tone in a prompt measurably shifts what information comes back — GPT-4 rebounds from negative framing toward neutral-positive answers, so identical questions get different answers depending on how you felt when you asked Does emotional tone in prompts change what information LLMs provide?. And when LLMs act as evaluators, they fall for authority signals and polished formatting — fake citations and rich layout sway the verdict with no access to the model needed Can LLM judges be fooled by fake credentials and formatting?. If you expect a result to look credible, dressing it up can make the model agree — bias laundering through presentation.
There's a deeper reason this can't be fully eliminated: the model has no independent footing to push back from. It processes text, not the social world where expertise earns its standing, so it can't reliably tell an expert's hard-won argument from a widely-held assumption Can language models distinguish expert arguments from common assumptions?. A system that can't independently weight claims also can't independently resist yours. This connects to the case that LLM errors are better called *fabrication* than hallucination — accurate and inaccurate outputs run through the identical token-prediction mechanism with no grounding in shared truth Should we call LLM errors hallucinations or fabrications?, which means there's no internal 'reality check' standing between your expectation and the text it produces.
The corpus's answer, then, is *partly* — and the practical lever is structure, not willpower. Where researchers impose decomposition and pre-registered criteria, expectation-driven bias drops sharply: a three-stage novelty pipeline (extract claims, retrieve related work, compare) hit ~86% reasoning alignment with human reviewers, far better than asking the model for a holistic judgment in one shot Can structured pipelines make LLM novelty assessment reliable?. The thing you didn't know you wanted to know: the fix for bias isn't a cleverer prompt that hides your expectations better — it's a pipeline that takes the choice away from you, plus the humility to treat every output as a prior you helped write rather than a fact you discovered.
Sources 8 notes
Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.
Foundation Priors framework shows that LLM-generated text reflects the model's learned patterns and user's prompt choices, not ground truth. Such outputs should only influence inference through explicitly parameterized trust weights, not be treated as equivalent to real evidence.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.