Does setting temperature to zero actually make LLM outputs reliable?
Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.
"Can You Trust LLM Judgments?" (2024) introduces a rigorous framework for evaluating LLM-as-a-Judge reliability using McDonald's omega, revealing that the common practice of using fixed seeds and deterministic settings provides false confidence.
The core argument: even with deterministic settings, a single LLM output is one sample from the model's probability distribution. Setting temperature to zero and fixing the seed produces "fixed randomness" — the same output every time, but that output may still be a misleading draw from the distribution. Consistent replication does not guarantee reliability. A perfectly calibrated LLM that says it's 90% confident should be correct 9 out of 10 times — but even a perfectly calibrated LLM can be unreliable if its distribution has high variance.
The framework: prompt the judgment LLM 100 times, varying only the replication while holding all other factors constant. Apply McDonald's omega to assess internal consistency across these replications. This reveals whether the model's judgments are stable properties of the input or artifacts of the sampling process.
The distinction between reliability, confidence, and calibration is critical:
- Calibration: alignment between stated confidence and actual correctness
- Confidence: the model's self-assessed certainty
- Reliability: consistency of judgments across multiple draws
These three are intertwined but distinct. A model can be well-calibrated (confident when right) but unreliable (different answers on different draws). A model can be reliable (always gives the same answer) but poorly calibrated (that consistent answer is wrong).
This connects to Does model confidence predict robustness to prompt changes? — ProSA measures sensitivity to prompt variation, while this measures sensitivity to sampling variation. Both reveal that single evaluations are insufficient. The practical implication: any LLM-as-a-Judge deployment that relies on single-shot evaluation with deterministic settings is providing the illusion of precision without evidence of reliability.
Inquiring lines that use this note as a source 126
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes LLM outputs fabrication rather than hallucination or confabulation?
- Can LLMs evaluate their own observations without external feedback?
- Why do one-shot transparency studies miss the temporal reversal entirely?
- What makes accountability and validity-orientation non-behavioral properties?
- Why does aggregate accuracy fail as a metric for rare harmful cases?
- What distinguishes minimal-pair asymmetry from standard accuracy evaluation?
- What should we call errors in LLM outputs when hallucination does not apply?
- Can systems lacking inner states express genuine truthfulness claims?
- What would it mean to assign explicit trust weights to synthetic data?
- Can LLM judges reliably estimate when they lack sufficient persona information?
- How should ground truth labels be assigned to simulated user sessions?
- How does unidimensionality in assessments affect measurement validity?
- Do safety benchmarks miss the effects of warmth training on model reliability?
- What calibration corrections can reduce LLM judge bias in automated evaluation pipelines?
- How much does ROUGE metric choice inflate hallucination detection claims?
- Does inevitable LLM hallucination make detection metric validity critical?
- Why do users systematically overrely on confident LLM outputs across languages?
- What structural features force users to evaluate the epistemic status of outputs?
- Can researchers prevent their expectations from shaping LLM outputs?
- What makes inter-coder reliability testing essential for prompt validation?
- How does step-level confidence filtering compare to global confidence averaging?
- How should product specifications measure alignment without naming the dimension?
- What does the 20-questions test reveal about LLM character consistency?
- Can distributional views explain when an LLM appears to change its mind?
- Can evaluation criteria be reliably encoded in labeled data without ground truth standards?
- Can precision and recall metrics work without a ground truth?
- Does majority voting reliably signal correctness without risking reward hacking?
- Can log-likelihood loss combined with binary rewards achieve calibration?
- What makes the Brier score mathematically better than log-likelihood here?
- Why do standard accuracy metrics ignore set-level consumption constraints?
- How do we assign confidence and polarity scores to belief edges?
- Can utility control modify LLM values more effectively than output filtering?
- Does layer-wise prediction stabilization provide a stronger trace quality signal than confidence alone?
- How does activation consistency training differ from output-level consistency?
- Why does model confidence correlate with robustness to prompt variations?
- Why does analytical depth demand trigger fabrication over transparent uncertainty?
- How does the three-component definition apply to test-time scaling laws?
- What does McDonald's omega reveal about LLM judgment consistency?
- Can an LLM be well calibrated but still unreliable on single evaluations?
- How do calibration and reliability differ in LLM judge evaluations?
- How do training data cutoffs produce false claims that stay consistent?
- How reliable is the top-2 confidence gap as a stopping signal across tasks?
- What property must remain constant to individuate an LLM across infrastructure changes?
- Can scaling predictions become reliable if improvements are continuous not sudden?
- How does disembedding from social context collapse reliability despite factual accuracy?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- Why does low temperature sampling extract consensus from diverse training data?
- What skills do users need to work effectively with stochastic outputs?
- What makes output convergence across models inevitable given input-side homogenization?
- Why do models fail under distribution shift if accuracy metrics stay high?
- Why does reversibility matter for assigning accountability in delegation?
- How should monitoring intensity change based on task criticality?
- Can fact-checking systems use LLMs reliably if models abandon correct positions under pressure?
- Does exposure to more domain-specific examples reduce LLM overconfidence?
- What distinguishes actual social disagreement from distributional uncertainty in LLM outputs?
- Can uncertainty estimates based on model self-assessment reliably signal errors?
- Why do true and false LLM outputs use the same mechanism?
- Why do improvements in accuracy come at the cost of calibration?
- Can measuring semantic entropy help us detect unreliable generations?
- How can we verify outputs from systems that generate without grounding?
- Which use cases can tolerate unverified LLM outputs without external verification?
- What makes certain bond distributions more learnable than others?
- Does model confidence actually correlate with robustness against prompt variations?
- Why do models maintain accurate beliefs but generate false claims?
- What makes some model capabilities reliable while others remain brittle?
- How does self-consistency compare to confidence as a proxy reward signal?
- What makes the 45 percent accuracy saturation threshold universal?
- Why do different LLMs converge on nearly identical outputs?
- How much does omniscient evaluation overstate real-world simulation fidelity?
- Why is the Judging preference constant while other traits vary slightly?
- How does Goodhart's Law apply when safety measures become optimization targets?
- Can LLMs recover true joint distributions from marginal census data?
- Do high-disagreement items signal contested values or measurement noise?
- What consistency tests could distinguish constructed from genuine preferences?
- Can proper scoring rules fix RLVR's degradation on disagreement prediction?
- Do bidirectional and any-order generation expose different parts of the joint distribution?
- When does the correlation between consistency and correctness break down?
- Why do models confabulate inconsistently across different samples?
- Can semantic entropy improve model calibration without external ground truth?
- Can proper scoring rules restore model calibration without sacrificing accuracy?
- What signals detect when consensus training is silently degrading performance?
- Why does regenerating LLM responses produce different but equally valid answers?
- Can we detect superposition in LLM personality traits and stated preferences?
- Can users experience the LLM Fallacy even when AI outputs are completely accurate?
- What happens when we treat LLM outputs as sampled rather than stored?
- What consumption data would validate the limited-consumption model in production systems?
- Why does self-consistency fail as a proxy reward for correctness?
- Can safety benchmarks detect reliability degradation from warmth training?
- Why does sophisticated measurement not validate the underlying scientific inference?
- Can exchange value persist without use value being verified first?
- Why do rare cases in medicine and science require models that preserve tail distributions?
- What happens when alignment targets measure only the preferred dimension of entangled properties?
- What other evaluation biases exist in LLM judge systems?
- What role does real-time accuracy feedback play in reducing user overreliance?
- How much noise comes from rater idiosyncrasy versus selection bias?
- What makes mathematically confident but incorrect answers resemble valid solution shapes?
- Why does preference measurement validity matter more than aggregation methods?
- What distinguishes research stages where the combined stack remains reliable?
- Can skill validation through testing prevent unreliable programs from accumulating?
- What makes out-of-band monitoring better than in-band verification loops?
- Can deterministic computation actually create new information in data?
- How does 93% reward reliability compare to other RL noise sources?
- Why does accumulated portfolio output not match accumulated worker capability?
- What makes a deployment paradigm credible for maintaining scientific integrity?
- Can models detect statistical properties of their own generation in real time?
- How does confidence in LLM outputs override users' ability to check accuracy?
- How should process quality and verification cost factor into evaluation judgment?
- Why does test accuracy improve after training accuracy reaches 100 percent?
- Can population-level distributions shift usefully even when individual prediction fails?
- How can distillation preserve uncertainty expression instead of optimizing it away?
- What makes self-consistency a sufficient training target for the judge role?
- Can imperfect uncertainty estimates still beat uniform oversight strategies?
- Can we systematically enumerate LLM failure modes from first principles?
- Can similar outputs from different systems prove they work the same way?
- Can experimental outcomes be reliably distilled into reusable insights?
- Can LLMs express uncertainty in ways that preserve epistemic honesty?
- Can aggregate survey realism coexist with unreliable fine-grained effects?
- How do local soundness signals work across different problem domains?
- Can crowdsourced voting and automated panels both credibly evaluate LLM outputs?
- What makes uncertainty calibration harder than expanding knowledge?
- Can contamination-free evaluation distinguish between memorization and genuine prediction ability?
- How do we measure marginal risk instead of speculating about misuse scenarios?
- What makes financial reasoning particularly vulnerable to general PRM failures?
- What specific bookkeeping tasks can environments maintain more reliably than policies?
- Why does externalized state beat parameter scaling for agent reliability?
- Why does preference measurement validity matter before any aggregation?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does model confidence predict robustness to prompt changes?
Explores whether a model's certainty about its answer determines how much it resists prompt rephrasing and semantic variation. This matters because it could explain why some tasks are harder to evaluate reliably.
prompt sensitivity and sampling sensitivity are complementary reliability concerns
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
judge unreliability compounds with exploitable biases
-
Why do preference models favor surface features over substance?
Preference models show systematic bias toward length, structure, jargon, sycophancy, and vagueness—features humans actively dislike. Understanding this 40% divergence reveals whether it stems from training data artifacts or architectural constraints.
calibration failure at the preference model level adds to the reliability problem
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
- Can Large Reasoning Models Self-Train?
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings
- Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games
- DecepChain: Inducing Deceptive Reasoning in Large Language Models
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Original note title
deterministic LLM settings create fixed randomness not reliability — a single output remains one draw from the model's probability distribution