Can LLM be a Personalized Judge?
Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized- Judge—asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples.
Introduction. As large language models (LLMs) gain widespread adoption among global users with diverse backgrounds, it is imperative to ensure these models designed to reflect their values and preferences (Sorensen et al., 2024; Kirk et al., 2024). However, the current alignment process often assumes a homogeneous set of human preferences and ignores individual perspectives, even in context-dependent, subjective tasks (Santurkar et al., 2023). Therefore, efforts have been made to fine-tune LLMs to encode individual preferences or enhance role-playing capabilities (Jang et al., 2023; Shao et al., 2023; Occhipinti et al., 2023; Li et al., 2024a; Andukuri et al., 2024) with “LLMas-a-Judge” as the main evaluation metric (Zheng et al., 2023), often without adequate validation. Despite “LLM-as-a-Judge” showing high agreement with human annotators in many tasks, its effectiveness for personalization tasks remains largely unscrutinized.
Discussion / Conclusion. In this paper, we formalized and examined the validity of LLM-as-a-Personalized-Judge. Contrary to previous assumptions, we demonstrated that the standard LLM-as-a-Judge setting is not sufficiently reliable for personalization tasks, showing low agreement with human ground truth. We identified persona sparsity as a major cause of this unreliability. We then introduced verbal certainty estimation and found that powerful LLMs (e.g. GPT-4) are capable of effectively assessing the certainty of their own responses. This led to the observation that high-certainty samples indeed exhibit high accuracy (80%). We additionally conducted a human annotation experiment and found that LLM-as-a- Personalized-Judge achieves comparable accuracy as third-person human judge and surpasses humans on high-certainty samples.