The Impact of AI-Generated Text on the Internet

Paper · Source

The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments (sometimes subsumed under the “Dead Internet Theory”). What has hindered answering these questions is that it has not been understood just how much of the internet is actually AI-generated or AI-edited. To this end, we construct a representative sample of websites published on the internet between 2022 and 2025 using the Internet Archive, and apply a state-of-the-art AI text detector on them. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022. We also find statistically significant evidence for some of the identified hypotheses; for example, that increases in AI-generated text on the internet correlate negatively with semantic diversity and positively with the prevalence of positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity.

Introduction. Ever since ChatGPT first made large language models (LLMs) available to the wider public in 2022, which was followed by mass adoption, there have been concerns about the impact of AI-generated text (as well as AI-generated content in other modalities) on the internet and online discourse (Ferrara, 2026; Muzumdar et al., 2025). Specifically, many known limitations and failure modes of LLMs, including factual hallucinations (Huang et al., 2025), sycophancy (Malmqvist, 2025), verbosity (Saito et al., 2023), and more, have raised concerns that unchecked proliferation of such content could reduce the overall quality of internet content (Shumailov et al., 2024; Xing et al., 2025). These hypotheses are sometimes subsumed under the “Dead Internet Theory,” which they loosely expand, but which, on its own, predates the widespread use of LLMs (Muzumdar et al., 2025). These hypotheses have been difficult to verify, primarily because there is limited understand- ing of how much internet content is actually AI-generated (Santy et al., 2025; Spennemann, 2025).

Discussion / Conclusion. Our study shows a shift in the composition of the open web, estimating that as much as 35% of newly published websites by mid-2025 have been AI-generated or AI-assisted. Notably, we find a divergence between the impacts of this shift on online discourse and the public perception of this phenomenon. While our survey (RQ1) reveals a public concern about systemic truth decay (Hyp. 2) and stylistic homogenization (Hyp. 6) as a result of AI-generated text proliferation, our webscale analysis (RQ3) does not yield statistically significant evidence of macro-level degradation in factual accuracy or a strict stylistic monoculture. This divergence suggests that the immediate threat to online discourse may be of an epistemic nature rather than purely factual. As AIgenerated text becomes ubiquitous and indistin- Rather than an explosion of falsehoods, the footprint of AI proliferation on the internet manifests primarily as semantic contraction (Hyp.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How does AI-generated content transformation affect public discourse quality?

Does AI text rewriting systematically distort writer intent and preference?

Can AI-generated outputs constitute genuine knowledge or valid claims?

What genuine cultural forms does AI homogeneity actually displace?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

What interpretive work must humans perform to experience AI as a conversation partner?

Does tokenized intelligence retain genuine value through exchange-based systems?

Why do print-era intuitions about commodities fail for AI outputs?

How should human oversight be integrated with autonomous AI systems?

Can humans develop oversight strategies that work across all GenAI rhetorical shifts?

What makes AI persuasion effective and how can we counter it?

Can readers distinguish between AI and human persuasion on textual surface alone?

Does conversational format create illusions of genuine AI communication?

Can audiences learn to recognize and resist moralized AI rhetoric?

The Impact of AI-Generated Text on the Internet

Synthesis notes that discuss concepts related to this paper 10

Lines of inquiry this paper opens 24