Linguistic Calibration of Long-Form Generations

Paper · arXiv 2404.00474 · Published March 30, 2024
Reinforcement Learning

Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as “I estimate a 30% chance of...” or “I am certain that...”, followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task.

Introduction. The claims made by language models (LMs) are increasingly used to inform real-world decisions, e.g., what to order at a restaurant, what information to provide someone else about a topic, or which code completion to accept. However, LMs have knowledge gaps which manifest as hallucinations (Huang et al., 2023, Ji et al., 2023). Currently, when an LM lacks knowledge about a topic, it will do one of two things: hallucinate incorrect claims with complete confidence, or, in the case of a few strong closed-source models (Anthropic, 2023, OpenAI et al., 2023), abstain from making claims. Confident hallucinations are especially harmful. They decrease users’ trust in the errant LM and broadly make LMs unsuitable for settings where factuality is paramount such as medicine (Thirunavukarasu et al., 2023) and law (Dahl et al., 2024). Perhaps most importantly, they lead the user to confidently make poor decisions (Fig. 1). However, even abstentions are suboptimal, because they provide the user with no plausible claims and their likelihoods.

Discussion / Conclusion. Limitations and future work. Our linguistically calibrated LM generalizes well from surrogate to crowdworker forecasts. However, many of the confidence statements it emits are fairly unambiguous, e.g., percentages. Therefore, future work could investigate how closely LM and human interpretations of ambiguous linguistic confidence statements match, which could enable training LMs with linguistic confidence statements that are tailored to user populations. Additionally, we use off-the-shelf question-answering datasets as a proxy for questions encountered during real-world decision-making. To improve LC’s generalization to decision-making scenarios in-the-wild, future work could curate a more representative QA dataset. Lastly, we work in a white-box setting where finetuning LMs is possible; our training framework could not be used to calibrate API-based LLMs that only provide access to completions. Conclusions.