TarGEN: Targeted Data Generation with Large Language Models

Paper · arXiv 2310.17876 · Published October 27, 2023
Training DataReinforcement Learning

We present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets using LLMs. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. This differentiates it from other data generation techniques, as it can be leveraged for novel or highly domain-specific tasks with no existing data instances. We augment TarGEN with a self-correction module that enables LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique’s effectiveness against existing baselines, we emulate eight tasks from the SuperGLUE benchmark to create a "synthetic" version and finetune various language models on both synthetic and original training sets.

Introduction. Large Language Models (LLMs) like ChatGPT, Llama (Touvron et al., 2023a;c), and Mistral (Jiang et al., 2023) have showcased impressive results across a plethora of tasks (OpenAI, 2023; Brown et al., 2020). As LLM capabilities advance, the tools to test the extent of these capabilities become insufficient (Liu et al., 2022b; He et al., 2023; Valmeekam et al., 2022; Chen et al., 2021). This is particularly true for domain-specific datasets, as the creation of expertly curated evaluation benchmarks is time and labor-intensive (Clark et al., 2018; Suzgun et al., 2022; Wang et al., 2022). Several synthetic dataset creation methods such as Self-Instruct (Wang et al., 2023), AttrPrompt (Yu et al., 2023) and ZeroGen (Ye et al., 2022a) have been proposed primarily for text classification tasks. These approaches employ in-context learning to generate synthetic data points that resemble their prompt exemplars, thereby inherently constraining their ability to produce diverse examples.

Discussion / Conclusion. In this work, we introduced TarGEN, a multi-step prompting strategy for generating high-quality and diverse synthetic datasets utilizing LLMs without any human supervision. We described a step-by-step methodology for TarGEN to synthesize a dataset from instructions without any task exemplars. To evaluate our proposed framework, we emulated eight tasks from the SuperGLUE benchmark and compared it with the original SuperGLUE by training different families of models. Experimental results reveal that models finetuned on our synthetic SuperGLUE outperform models finetuned on the original SuperGLUE. A comprehensive analysis of synthetic benchmark with respect to the original benchmark resulted in several interesting findings such as the fact that data instances in our synthesized benchmark are more difficult and diverse compared to the original benchmark, and also exhibit similar dataset bias. Further comparison with Self-Instruct and AttrPrompt revealed that synthetic SuperGLUE served as better pre-finetuning corpora when evaluated on the OpenLLM benchmark resulting in an impressive performance when using T5-3B and Llama2-7B.