A Survey on Knowledge Distillation of Large Language Models

Paper · arXiv 2402.13116 · Published February 20, 2024

Abstract—In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their selfimprovement by employing themselves as teachers. This paper presents a comprehensive survey of KD’s role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and selfimprovement. Our survey is meticulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs’ performance.

Introduction. In the evolving landscape of artificial intelligence (AI), proprietary1 Large Language Models (LLMs) such as GPT- 3.5 (Ouyang et al., 2022), GPT-4 (OpenAI et al., 2023), Gemini (Team et al., 2023) and Claude2 have emerged as groundbreaking technologies, reshaping our understanding of natural language processing (NLP). These models, characterized by their vast scale and complexity, have unlocked new realms of possibility, from generating humanlike text to offering sophisticated problem-solving capabilities. The core significance of these LLMs lies in their emergent abilities (Wei et al., 2022a,b; Xu et al., 2024a), a phenomenon where the models display capabilities beyond their explicit training objectives, enabling them to tackle a diverse array of tasks with remarkable proficiency. Their deep understanding of context, nuance, and the intricacies of human language enables them to excel in a wide array of applications, from creative content generation to complex problem-solving (OpenAI et al., 2023; Liang et al., 2022).

Discussion / Conclusion. This survey has traversed the expansive domain of knowledge distillation applied to LLMs, shedding light on the myriad techniques, applications, and emerging challenges in this vibrant field. We have underscored the pivotal role of KD in democratizing access to the advanced capabilities of proprietary LLMs, thereby fostering a more equitable AI landscape. Through meticulous examination, we have highlighted how KD serves as a bridge, enabling resourceconstrained entities to benefit from the profound advancements in LLMs without the prohibitive costs associated with training and deploying state-of-the-art models. Our exploration delineates the multifaceted approaches to KD, ranging from algorithmic innovations and skill enhancement to domain-specific distillations. Each segment reveals the nuanced complexities and potentialities inherent in tailoring distilled models to emulate the sophisticated understandings and functionalities of their more cumbersome counterparts.

A Survey on Knowledge Distillation of Large Language Models

Synthesis notes that discuss concepts related to this paper