Self-critiquing models for assisting human evaluators
We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own selfcritiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly.
Introduction. With increasingly capable language models, it is important to ensure models are trustworthy on difficult and high stakes tasks. For example, models are being used to write complex pieces of code [CTJ+21, LCC+22] and answer open-ended questions about the world [NHB+21, MTM+22]. We would like to be able to train models that don’t write buggy code or spread misinformation. However, fully evaluating correctness of code or veracity of facts about the world requires a lot of effort and expertise. Techniques to train systems from human feedback [NR+00, Wes16, CLB+17, JMD20, NMS+21, SCC+22], fundamentally depend on humans’ ability to demonstrate and evaluate the quality of model outputs. This leads to the problem of scalable oversight [AOS+16]: How can we effectively provide feedback to models on tasks that are difficult for humans to evaluate? One idea to overcome this problem is to use AI systems to aid human evaluation. This basic idea comes up in many prior proposals, such as iterated amplification [CSA18], debate [ICA18], and recursive reward modeling [LKE+18].
Discussion / Conclusion. We view our results as a proof of concept for feedback assistance as a solution to the problem of scalable oversight: Even though topic-based summarization isn’t actually a hard task for human labelers, in our experiments we still see significant gains from AI assistance in the form of critiques. 1. Large language models are already capable enough to meaningfully assist human evaluation and the scaling trend in Figure 4 suggests that larger models may improve at assisting in evaluating their own outputs. The publicly available InstructGPT models are capable of critiquing well few-shot and even zero-shot (Figure 3). Overall, we believe there is potential to do empirical experiments for scalable oversight with today’s models, using schemes similar to reward modeling [LKE+18] or debate [IA19]. 3. Learning from natural language feedback is feasible now.