Long-form Factuality In Large Language models

Paper · arXiv 2403.18802 · Published March 27, 2024
LLM Failure Modes

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model’s long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user’s preferred response length (recall).

Introduction. Large Language Models (LLMs) have significantly improved in recent years (Brown et al., 2020; Chowdhery et al., 2022; Google, 2023; OpenAI, 2023; Gemini Team, 2023, inter alia) but still lack reliability in responding to in-depth factuality questions. In particular, they often produce factual errors in which a claim contradicts established ground-truth knowledge (Huang et al., 2023; Zhang et al., 2023, inter-alia).1 For example, models may respond with incorrect information about established facts such as dates, statistics, or even a celebrity’s occupation (Li et al., 2023; Min et al., 2023; Muhlgay et al., 2023). These factual errors weaken a language model’s factuality, making the model unreliable in many real-world settings where a factually-accurate response is expected. In this paper, we propose a new prompt set called LongFact, an evaluation method named SAFE, and a metric (F1@K) for quantifying the long-form factuality of a long-form response. We then extensively benchmark popular large language models using these new datasets and evaluation methods.

Discussion / Conclusion. In this paper, we examined how to thoroughly benchmark long-form factuality in large language models. To do so, we first used GPT-4 to generate LongFact, a set of 2,280 prompts spanning 38 different topics. We then proposed using language-model agents to automatically evaluate the longform factuality of a model’s response. This method, which we call SAFE, utilizes a search-enabled large language model to split a long-form response into individual facts, revise individual facts to be self-contained, determine the relevance of each individual fact to answering the prompt, and check the factuality of each relevant fact by issuing Google Search queries. Furthermore, we extended recall into the long-form domain by utilizing a hyperparameter K as a proxy for human-preferred length of responses, and we combined this measurement of recall with precision in F1@K. Empirically, we demonstrated that SAFE outperforms crowdsourced human annotators by agreeing with 72% of human annotations and winning 76% of examples out of a set of 100 randomly-sampled disagreement cases.