Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model’s free-form generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT- 4. While this method has grown in popularity, it is costly, has been shown to introduce intra-model bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
Introduction. Evaluating generative language models is a challenging task: not only is it difficult to find meaningful data to test the models, but evaluating the correctness of a generated response is itself a challenge. Multiple choice datasets like MMLU (Hendrycks et al., 2020) have become popular in part by side-stepping the difficulty of evaluating generations. However, multiple-choice questions are in many ways probing a different property than that of a free-form generative task, which is oftentimes closer to the downstream use-case. Many automatic metrics have been used across various tasks such as BLEU in machine translation (Papineni et al., 2002), ROUGE for summa- rization (Lin, 2004), and heuristic string match methods, such as normalized exact match (EM) and token level F1 for question answering (Rajpurkar et al., 2016). However, these simplistic methods commonly fail to analyze the intended property of interest. QA metrics, for example, invariably lead to both false positive failures (e.g.
Discussion / Conclusion. In this paper, we showed how a Panel of LLM Evaluators composed of smaller models is not only an effective method for evaluating LLM performance, but also reduces intra-model bias, latency, and cost. The benefits of PoLL are bolstered by the finding that there is not a single ’best’ judge across all settings, while PoLL performs well consistently. In this work we investigated only three evaluator settings and a limited number of judges and panel compositions. While we showed that PoLL is an effective alternative to a single large model in these settings, further work is needed to see how broadly applicable the method is, for example, in math or reasoning evaluations, where language models often struggle (Zheng et al., 2024). We also leave the task of ’panel selection’, or identifying the best models to include in PoLL in terms of quality and cost, to future work.