Distilling LLMs' Decomposition Abilities into Compact Language Models

Paper · arXiv 2402.01812 · Published February 2, 2024
Training and Fine-Tuning

Large Language Models (LLMs) have demonstrated proficiency in their reasoning abilities, yet their large size presents scalability challenges and limits any further customization. In contrast, compact models offer customized training but often fall short in solving complex reasoning tasks. This study focuses on distilling the LLMs’ decomposition skills into compact models using offline reinforcement learning. We leverage the advancements in the LLM‘s capabilities to provide feedback and generate a specialized task-specific dataset for training compact models. The development of an AI-generated dataset and the establishment of baselines constitute the primary contributions of our work, underscoring the potential of compact models in replicating complex problem-solving skills1.

Introduction. Recent strides in Natural Language Processing (NLP) have brought forth powerful Large Language Models (LLMs) like GPT-4 (OpenAI, 2023), Claude 22, or Gemini (Team et al., 2023). These models not only excel at straightforward tasks such as summarization and sentiment analysis but, with adept prompting, demonstrate proficiency in handling reasoning tasks that demand mathematical and logical abilities (Huang & Chang, 2022). Notably, Chain-of-Thoughts (CoT) prompting (Wei et al., 2022) and its variations (Kojima et al., 2022; Wang et al., 2022) have proven to be promising and relatively simple techniques for enhancing LLMs’ reasoning capabilities. Within the realm of complex reasoning, the ability to decompose intricate questions into a set of simpler sub-questions represents a crucial and understudied component (Shridhar et al., 2022). While existing works predominantly focus on end-to-end solutions for reasoning (Zhou et al., 2022; Lyu et al., 2023), the specific aspect of breaking down complex questions into simpler components has received limited attention.

Discussion / Conclusion. This work introduces a novel AI-generated benchmark tailored for evaluating sub-questioning in reasoning tasks. We employ diverse offline learning approaches, varying model sizes for baselines, and assess the performance using different LLMs. Our experiments aim to shed light on the challenges and potential avenues for enhancing reasoning capabilities. The outcomes reveal a significant performance gap between the best-performing approach and Chat- GPT. The underwhelming performance of the offline RL approach underscores the need for further advancements in this domain, presenting an opportunity for future research to explore and refine these methodologies. By providing this benchmark, we aspire to catalyze research endeavors in the realm of subquestioning. We anticipate that the dataset curated in this work will serve as a foundational resource for assessing the reasoning capabilities of emerging offline RL approaches in the field of NLP.