SYNTHESIS NOTE

Does setting temperature to zero actually make LLM outputs reliable?

Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.

Synthesis note · 2026-03-28 · sourced from Evaluations

"Can You Trust LLM Judgments?" (2024) introduces a rigorous framework for evaluating LLM-as-a-Judge reliability using McDonald's omega, revealing that the common practice of using fixed seeds and deterministic settings provides false confidence.

The core argument: even with deterministic settings, a single LLM output is one sample from the model's probability distribution. Setting temperature to zero and fixing the seed produces "fixed randomness" — the same output every time, but that output may still be a misleading draw from the distribution. Consistent replication does not guarantee reliability. A perfectly calibrated LLM that says it's 90% confident should be correct 9 out of 10 times — but even a perfectly calibrated LLM can be unreliable if its distribution has high variance.

The framework: prompt the judgment LLM 100 times, varying only the replication while holding all other factors constant. Apply McDonald's omega to assess internal consistency across these replications. This reveals whether the model's judgments are stable properties of the input or artifacts of the sampling process.

The distinction between reliability, confidence, and calibration is critical:

Calibration: alignment between stated confidence and actual correctness
Confidence: the model's self-assessed certainty
Reliability: consistency of judgments across multiple draws

These three are intertwined but distinct. A model can be well-calibrated (confident when right) but unreliable (different answers on different draws). A model can be reliable (always gives the same answer) but poorly calibrated (that consistent answer is wrong).

This connects to Does model confidence predict robustness to prompt changes? — ProSA measures sensitivity to prompt variation, while this measures sensitivity to sampling variation. Both reveal that single evaluations are insufficient. The practical implication: any LLM-as-a-Judge deployment that relies on single-shot evaluation with deterministic settings is providing the illusion of precision without evidence of reliability.

Inquiring lines that read this note 134

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does setting temperature to zero actually make LLM outputs reliable?

Inquiring lines that read this note 134

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4