Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
It has become routine to report research results where Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks, and creative text writing is no exception. It seems natural, then, to raise the bid: Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected 5,400 manual assessments provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer, and that reaching such level of autonomous creative writing skills probably cannot be reached simply with larger language models.
Introduction. Large Language Models (LLMs) have recently showed strong competences generating human-like text, and in particular in creative writing tasks [Achiam et al., 2023], which is the focus of this paper. LLMs are increasingly influencing creative industries, impacting both the economy and the labor market, as highlighted by significant events such as the Hollywood screenwriters’ strike [Lee, 2022, Eloundou et al., 2023, Koblin and Barnes, 2023]. Experimentation shows that, under different settings, LLMs can perform better than average humans at short creative writing tasks [Marco et al., 2023, Gómez-Rodríguez and Williams, 2023]. LLMs seem to be ready, then, for the next level of experimental inquiry: Can they already compete with the best human creative writers? Note that, in the history of Artificial Intelligence (AI), symbolic landmarks involve competition between the best AI systems with the best humans at the task, as in DeepBlue vs Kasparov [Campbell et al., 2002] and AlphaGo vs Lee Sidol [Silver et al., 2016].
Discussion / Conclusion. Are LLMs ready to challenge world-class novelists, rather than average writers, in literary writing tasks? Our study, inspired by historic duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol, was designed to answer this question. We designed an experimental setting where both GPT-4 and Patricio Pron, our contenders, received the same information about the contest, were proposed the same tasks, and were blindly evaluated with the same rubric by a set of six literature experts (critics and scholars). We collected 60 titles (proposed by the contenders), 120 literary texts (60 provided by each of the contenders for each of the 60 titles), and 5,400 expert assessments on different quality aspects of the titles and texts produced. Our results indicate that GPT-4 Turbo, despite its impressive writing capabilities, still falls short of matching the skills of a world-class novelist.