DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents

Paper · arXiv 2303.17071 · Published March 30, 2023
LLM Agents

Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks. In safety-critical applications such as healthcare, the utility of these models is governed by their ability to generate outputs that are factually accurate and complete. In this work, we present dialogenabled resolving agents (DERA). DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT- 4. It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output. We frame our dialog as a discussion between two agent types – a Researcher, who processes information and identifies crucial problem components, and a Decider, who has the autonomy to integrate the Researcher’s information and makes judgments on the final output. We test DERA against three clinically-focused tasks. For medical conversation summarization and care plan generation, DERA shows significant improvement over the base GPT- 4 performance in both human expert preference evaluations and quantitative metrics. In a new finding, we also show that GPT-4’s performance (70%) on an open-ended version of the MedQA question-answering (QA) dataset (Jin et al.

Introduction. Large language models (LLMs; Brown et al. (2020); Lewis et al. (2020)) are deep-learning models that have been trained to predict natural language text conditioned on an input. The use of these models has led to advances in natural language performance far beyond just language mod- eling tasks. Within the realm of medicine, LLMpowered methods have shown improvements in medical tasks such as question answering (Singhal et al., 2022; Liévin et al., 2022), information extraction (Agrawal et al., 2022), and summarization (Chintagunta et al., 2021). LLM-powered methods use natural language instructions called prompts. These instruction sets often include a task definition, rules the predictions must follow, and optionally some examples of the task input and output (Reynolds and McDonell, 2021; Brown et al., 2020). The ability of generative language models to create output based on natural language instructions (or prompts) removes the need for task-specific training (Min et al., 2022) and allows non-experts to build upon this technology.

Discussion / Conclusion. We introduce a framework for agent-agent dialog called DERA. This approach allows agents to focus on specific roles, reducing the need for an LLM to achieve the correct answer in one or two generations. In this setup, we use two types of agents – Researcher, tasked with reviewing and selecting information, and Decider, tasked with integrating that information into the final output. Both dis- cuss the problem in a chat format, with the goal of improving the output of GPT-4. As found in Sections 3 and 4, we find DERA improves the quality of the generated text in a variety of metrics. Importantly, this reduces the number of hallucinations and omissions in the resulting text. This finding is important given the ability of large language models (LLM), in particular GPT-4, to generate text that is fluent but potentially prone to errors. The ability of DERA to identify and correct these hallucinations and omissions is critical when applying these models to real-world scenarios.