Fake News Detectors are Biased against Texts Generated by Large Language Models

Paper · arXiv 2309.08674 · Published September 15, 2023
Sentiment, Semantics, and Toxicity Detection

The spread of fake news has emerged as a critical challenge, undermining trust and posing threats to society. In the era of Large Language Models (LLMs), the capability to generate believable fake content has intensified these concerns. In this study, we present a novel paradigm to evaluate fake news detectors in scenarios involving both human-written and LLM-generated misinformation. Intriguingly, our findings reveal a significant bias in many existing detectors: they are more prone to flagging LLM-generated content as fake news while often misclassifying human-written fake news as genuine. This unexpected bias appears to arise from distinct linguistic patterns inherent to LLM outputs. To address this, we introduce a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. The resulting model yielded marked improvements in detection accuracy for both human and LLM-generated news. To further catalyze research in this domain, we release two comprehensive datasets, GossipCop++ and PolitiFact++, thus amalgamating humanvalidated articles with LLM-generated fake and real news.

Introduction. The dissemination of false information can cause chaos, hatred, and trust issues, and can eventually hinder the development of society as a whole (Wasserman and Madrid-Morales, 2019). Among them, fake news is often used to manipulate certain populations and had a catastrophic impact on multiple events, such as Brexit (Bastos and Mercea, 2019), the COVID-19 pandemic (van Der Linden et al., 2020), and the 2022 Russian assault on Ukraine (Mbah and Wasum, 2022). To spread such fake news, adversaries conventionally will deploy propaganda techniques and manually write the fake news (Huang et al., 2022). Creating convincing disinformation manually is a labor-intensive and time-consuming process, which may limit the scale and speed at which such content can be produced. This makes it less efficient and desirable for adversaries who aim for widespread and rapid dissemination of false information (Zellers et al., 2019).

Discussion / Conclusion. In this study, we introduced a novel paradigm for fake news detection, factoring in both humanwritten and LLM-generated news articles. Our investigations uncovered an unexpected bias: detectors frequently misclassify truthful LLM outputs as fake. Delving deeper, we identified potential linguistic ‘shortcuts’ these detectors take. Our mitigation strategy, founded on adversarial training with LLM-paraphrased real news, effectively reduced this bias. We further contributed by offering two enriched datasets, GossipCop++ and PolitiFact++, enhancing the scope for future research in this domain.