A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets, using 23 data sets covering 8 different common NLP application tasks. We extensively evaluate the multitask, multilingual, and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zeroshot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. ChatGPT suffers from hallucination problems like other LLMs. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e., 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We release a code for evaluation set extraction.1
Introduction. ChatGPT is a successor of the large language model (LLM) InstructGPT (Ouyang et al., 2022) with a dialog interface that is fine-tuned using the Reinforcement Learning with Human Feedback (RLHF) (Christiano et al., 2017) approach. Chat- GPT has gathered 100 million monthly active users in such a short period of time (Hu, 2023) and is being used by businesses and consumers alike for a myriad of mostly textual tasks. One reason for its unprecedented popularity is that ChatGPT, through its scale and via RLHF, has shown impressive abilities in many areas of NLP as well as emergent abilities. Another reason is that its dialog interface allows users to interact with the underlying LLM more effectively and efficiently via interactive chats that are akin to multi-turn prompting.
Discussion / Conclusion. Multitask, Multilingual, Multimodal ChatGPT outperforms SOTA LLMs in a zero-shot manner on various tasks and even surpasses fine-tuned models on some tasks. However, there are still some failure cases (§2.1) and it produces responses with altered nuance and meaning. Therefore, dealing with these special cases is a complex but important task. In terms of multilinguality, ChatGPT achieves strong performance in many high-resource and mediumresource languages. Nevertheless, ChatGPT still lacks the ability to understand and generate sentences in low-resource languages (§2.2), which is also supported by Lai et al. (2023). Additionally, ChatGPT lacks the generation ability of non-Latin script languages (§2.2.2), despite the languages being high-resource. These raise the concern of language diversity and inclusivity in ChatGPT (Joshi et al., 2020; Aji et al., 2022). Regarding multimodality, our flag drawing experiments showed the potential of ChatGPT’s multimodal ability. It would be an interesting research direction to further explore ChatGPT’s multimodal ability to answer “can textual models like ChatGPT switch to a multimodal backbone?”