The Impact of AI-Generated Text on the Internet
The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments (sometimes subsumed under the “Dead Internet Theory”). What has hindered answering these questions is that it has not been understood just how much of the internet is actually AI-generated or AI-edited. To this end, we construct a representative sample of websites published on the internet between 2022 and 2025 using the Internet Archive, and apply a state-of-the-art AI text detector on them. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022. We also find statistically significant evidence for some of the identified hypotheses; for example, that increases in AI-generated text on the internet correlate negatively with semantic diversity and positively with the prevalence of positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity.
Introduction. Ever since ChatGPT first made large language models (LLMs) available to the wider public in 2022, which was followed by mass adoption, there have been concerns about the impact of AI-generated text (as well as AI-generated content in other modalities) on the internet and online discourse (Ferrara, 2026; Muzumdar et al., 2025). Specifically, many known limitations and failure modes of LLMs, including factual hallucinations (Huang et al., 2025), sycophancy (Malmqvist, 2025), verbosity (Saito et al., 2023), and more, have raised concerns that unchecked proliferation of such content could reduce the overall quality of internet content (Shumailov et al., 2024; Xing et al., 2025). These hypotheses are sometimes subsumed under the “Dead Internet Theory,” which they loosely expand, but which, on its own, predates the widespread use of LLMs (Muzumdar et al., 2025). These hypotheses have been difficult to verify, primarily because there is limited understand- ing of how much internet content is actually AI-generated (Santy et al., 2025; Spennemann, 2025).
Discussion / Conclusion. Our study shows a shift in the composition of the open web, estimating that as much as 35% of newly published websites by mid-2025 have been AI-generated or AI-assisted. Notably, we find a divergence between the impacts of this shift on online discourse and the public perception of this phenomenon. While our survey (RQ1) reveals a public concern about systemic truth decay (Hyp. 2) and stylistic homogenization (Hyp. 6) as a result of AI-generated text proliferation, our webscale analysis (RQ3) does not yield statistically significant evidence of macro-level degradation in factual accuracy or a strict stylistic monoculture. This divergence suggests that the immediate threat to online discourse may be of an epistemic nature rather than purely factual. As AIgenerated text becomes ubiquitous and indistin- Rather than an explosion of falsehoods, the footprint of AI proliferation on the internet manifests primarily as semantic contraction (Hyp.