Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Paper · arXiv 2109.05771 · Published September 13, 2021
LLM Evaluations and Benchmarks

A questionnaire with text on it

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose CheckLists for better design and evaluation of automatic metrics. We design templates which target a specific criteria (e.g., coverage) and perturb the output such that the quality gets affected only along this specific criteria (e.g., the coverage drops).

Introduction. As the number of tasks and benchmarks for NLG have increased (Gehrmann et al., 2021), the challenges in evaluating NLG systems have also continued to grow (Liu et al., 2016; Nema and Khapra, 2018; Sai et al., 2019). One reliable way of evaluating NLG systems is to collect human judgements. However, this is a time consuming and expensive process (Freitag et al., 2021; Deriu et al., 2019; Howcroft et al., 2020). Hence, automatic evaluation metrics such as BLEU (Papineni et al., 2002) which are quicker to compute have become popular, despite being less reliable (Callison-Burch et al., 2006; Reiter, 2018). The survey by Sai et al. (2020b) shows that more than 35 automatic evaluation metrics have been proposed for NLG since 2014, however, there is no careful evaluation of the ability of such metrics to assess the quality of the output of an NLG system on multiple desired criteria. For example, consider the task of dialog evaluation, where humans are asked to score the output on multiple criteria such as fluency, adequacy, coherence, informativeness, engagingness, consistency, etc.

Discussion / Conclusion. proposed metrics such as DEB (Sai et al., 2020a), BLANC (Vasilyev et al., 2020) and MaUde (Sinha et al., 2020) do not correlate well with human judgements on other criteria. This is despite the fact that these are task-specific metrics which use the modern machinery of pre-trained BERT-based models and are fine-tuned on human judgements for overall quality. This vindicates our stand that simply tuning for overall quality does not lead to good correlations with other criteria. We do observe a few decently correlated metrics for some of the tasks along a few criteria.