The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
The “LLM-as-a-judge” paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure – the Alternative Annotator Test (alt-test) – that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and visionlanguage tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.1
Introduction. The rise of Large Language Models (LLMs) has transformed the field of Natural Language Processing (NLP), bringing unprecedented capabilities in reasoning and generating human-like text (Kojima et al., 2022; Achiam et al., 2023; Laskar et al., 2023; Yang et al., 2024). Recently, a new trend has emerged where LLMs are employed as annotators and judges across various NLP applications (Li et al., 2024a; Tan et al., 2024b). One key advantage of LLM-as-a-judge2 is the scalability and speed of LLMs. They can quickly annotate large-scale datasets, reducing the time required for tasks traditionally performed by costly human annotators (Nasution and Onan, 2024). Additionally, LLMs avoid challenges inherent to human factors, like fatigue and guideline misinterpretation (Uma et al., 2021; Bartsch et al., 2023). Unsurprisingly, they sometimes outperform crowdworkers (Gilardi et al., 2023; Nahum et al., 2024). Indeed, LLMs-as-judges are extensively used in research, taking on a pivotal role once filled by human annotators.
Discussion / Conclusion. Science advances through systematic observation, precise measurement, and the rigorous validation of hypotheses. It is no coincidence that Pearson famously claimed statistics to be “the grammar of science”. As results and findings of studies increasingly rely on LLMs instead of human annotators, extra care is needed to uphold scientific rigor. In this paper, we proposed a statistical procedure to justify using LLM annotations in research studies, the alt-test, which is simple and requires minimal effort. As demonstrated in our analysis, researchers can recruit a small group of annotators (at least three) to annotate a subset of 50 to 100 examples, depending on the complexity of the task. If the LLM achieves a winning rate above 0.5, its annotations can be confidently relied upon. Our findings also suggest using few-shot prompting, as this technique enhances LLM performance.