On The Persona-based Summarization of Domain-Specific Documents

Paper · arXiv 2406.03986 · Published June 6, 2024

In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.) approach is imperative to deliver targeted medical information efficiently. Persona-based summarization of domain-specific information by humans is a high cognitive load task and is generally not preferred. The summaries generated by two different humans have high variability and do not scale in cost and subject matter expertise as domains and personas grow. Further, AI-generated summaries using generic Large Language Models (LLMs) may not necessarily offer satisfactory accuracy for different domains unless they have been specifically trained on domain-specific data and can also be very expensive to use in day-to-day operations. Our contribution in this paper is two-fold: 1) We present an approach to efficiently fine-tune a domain-specific small foundation LLM using a healthcare corpus and also show that we can effectively evaluate the summarization quality using AI-based critiquing.

Introduction. In the rapidly expanding digital world, the exponential growth of domain-specific knowledge has posed unprecedented challenges in efficiently storing and consuming vast information repositories. With the increasing complexity of managing such information, the need for generation of precise and specific summaries becomes important. This need becomes particularly evident for domain-specific data as there exist different personas within a domain who have different information requirements which should be reflected in generated summaries. For instance, if we consider the healthcare domain, there exist diverse personas1 ranging from healthcare professionals like doctors and nurses to patients who require targeted information customized to their specific roles and comprehension levels. Traditional generic approaches to summarization have often relied on humans to perform this high cognitive load task.

Discussion / Conclusion. In this paper, we propose a framework for the efficient training of a small foundation LLM on AIgenerated datasets to obtain high quality domainspecific persona-based summaries. Our focus is on training a finetuned version of Llama2 on a corpus related to healthcare domain such that the trained model captures the intricacies of personabased summaries in healthcare domain. We also demonstrate the effectiveness of using AI-based critiquing for the evaluation of the model generated summaries, providing a more automated and scalable solution. Our experiments also reveal the superior quality of persona-based summaries generated by our finetuned model compared to contemporary baselines. Further, AI-based critiquing of the summaries show high inter-annotator agreement with human-based critiquing methods, further confirming the effectiveness of our proposed approach.

On The Persona-based Summarization of Domain-Specific Documents

Synthesis notes that discuss concepts related to this paper