KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Paper · arXiv 2306.09296 · Published June 15, 2023
LLM Evaluations and Benchmarks

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering 19 tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate 21 open-source and commercial LLMs and obtain some intriguing findings.

Introduction. Recent remarkable breakthroughs achieved by large language models (LLMs) like GPT-4 [1] have elicited widespread astonishment. Considering the extensive and profound natural language understanding and generation abilities exhibited by LLMs [2], the conventional benchmarks [3, 4] focusing on relatively narrow and superficial abilities are no longer as helpful for testing them. It has become necessary to construct better benchmarks for effectively comparing LLMs and providing valuable diagnostic results. To this end, various benchmarks are proposed, focusing on extending the evaluation scope to cover broader abilities [5, 6, 7] or more challenging tasks [8, 9].

Discussion / Conclusion. This paper presents KoLA, a carefully designed Knowledge-oriented LLM assessment benchmark. We design a cognitive ability taxonomy for more helpful diagnostic results, adopt both known and evolving data sources for better fairness, and employ contrastive metrics for high applicability. In the