Extreme Multi-Label Skill Extraction Training using Large Language Models
Abstract. Online job ads serve as a valuable source of information for skill requirements, playing a crucial role in labor market analysis and erecruitment processes. Since such ads are typically formatted in free text, natural language processing (NLP) technologies are required to automatically process them. We specifically focus on the task of detecting skills (mentioned literally, or implicitly described) and linking them to a large skill ontology, making it a challenging case of extreme multi-label classification (XMLC). Given that there is no sizable labeled (training) dataset are available for this specific XMLC task, we propose techniques to leverage general Large Language Models (LLMs). We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction, and present a contrastive learning strategy that proves effective in the task. Our results across three skill extraction benchmarks show a consistent increase of between 15 to 25 percentage points in R- Precision@5 compared to previously published results that relied solely on distant supervision through literal matches.
Introduction. Job ads are published online on a daily basis. They contain valuable information about economic trends in the labor market, such as the evolution of skill demand in time. Given that vacant jobs are advertized in unstructured text, we need automatic information extraction methods, e.g., to extract such mentioned skills. Such information extraction is crucial in labor market analysis and e-recruitment applications, including resume screening and job recommendation systems. Thus it is unsurprising that in the last decade, the number of studies on skill extraction methods has increased tenfold [6]. Several works have simplified the skill extraction problem to a pure detection task, limited to identifying the text span expressing a skill. Such a solution thus forgoes the normalization of synonyms and paraphrased skills toward an ontology of skill labels.
Discussion / Conclusion. This paper presents a cost-effective method for generating a comprehensive synthetic dataset of sentences, grounded in the ESCO ontology. The size of this dataset surpasses any previously annotated dataset for skill extraction and covers 99.5% of skills in ESCO. We demonstrate that a bi-encoder can be optimized using a contrastive training procedure to effectively represent both skill names and corresponding sentences in close proximity within the same space. This approach outperforms our distance supervision baseline by a large margin. Additionally, we propose a simple augmentation method that enhances the resulting model quality. We release the full dataset to foster future research in this area. It is crucial to carefully monitor and reduce any potential biases that might emerge from the data generation procedure. Bias evaluation and mitigation strategies need to be in place to make sure that the final model does not reinforce unjust or discriminating outcomes. Finally, any development of a skill extraction method should keep in mind the final application for which the extracted skills will serve. Only with respect to this application can fairness be defined and evaluated.