SYNTHESIS NOTE

How much of the internet is AI-generated now?

What share of newly published websites contain AI-generated or AI-assisted content, and what measurable changes does this cause across semantic diversity, sentiment, accuracy, and style?

Synthesis note · 2026-04-18 · sourced from Social Theory Society

A representative sample of websites from the Internet Archive (2022-2025) measured with a state-of-the-art AI text detector finds that "roughly 35% of newly published websites were classified as AI-generated or AI-assisted" by mid-2025, up from zero before ChatGPT's launch in late 2022. This is the first large-scale empirical baseline for a phenomenon previously discussed only through anecdote and speculation (the "Dead Internet Theory").

What the data shows:

Semantic diversity correlates negatively with AI text prevalence — ideas converge as AI content grows
Positive sentiment correlates positively with AI text prevalence — the internet gets more upbeat
Factual accuracy shows no statistically significant change
Stylistic diversity shows no statistically significant change

The perception gap. A user study found that the majority of US adults believe all four hypotheses (reduced semantic diversity, increased positive sentiment, decreased factual accuracy, decreased stylistic diversity). People who do not use AI or use it infrequently believe in the negative impacts more; those who hold negative views of AI believe more strongly in the hypotheses. The perception of harm exceeds the measured harm on two of four dimensions — but is validated on the other two. Public fear is neither paranoia nor prophecy; it is half right.

The semantic diversity finding is the key result. Stylistic diversity is preserved — the words vary — but semantic diversity declines. This mirrors the pattern from since Do different AI models actually produce diverse outputs?: surface variation masks idea convergence. The internet is saying the same things in different ways.

Connection to model collapse. Since Does training on AI-generated content permanently degrade model quality?, the 35% AI content baseline establishes the starting condition for recursive degradation. If future models train on web crawls that are already one-third AI-generated, the tail distribution loss accelerates. The semantic diversity decline measured here may be the early empirical signal of model collapse manifesting in the wild, not in lab experiments.

The positive sentiment bias confirms what the homogeneity research predicts: AI output defaults to agreeable, constructive, and upbeat framing. Since Does AI homogenize culture the way mass media did?, the sentiment shift represents the AI culture industry's affective signature — systematically positive, systematically inoffensive, systematically unremarkable.

The factual accuracy non-finding is surprising given hallucination concerns but may reflect selection effects: AI-generated websites that contain obvious factual errors may be less likely to persist in the archive, or factual accuracy may be domain-dependent in ways the aggregate measure misses.

Inquiring lines that read this note 5

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How does AI-generated content transformation affect public discourse quality?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 156 in 2-hop network ·medium cluster Open in graph ↗

How much of the internet is AI-generated now? Do different AI models actually produce diverse ou… Does training on AI-generated content permanently … Does AI homogenize culture the way mass media did? Can humans detect AI text if machines can measure … Why do fake news detectors flag AI-generated truth…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do different AI models actually produce diverse outputs? Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
semantic convergence despite stylistic variety; the mechanism behind declining semantic diversity
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
35% AI content is the baseline for recursive degradation
Does AI homogenize culture the way mass media did? If AI generates contextually unique outputs, how can its underlying form be homogeneous? This explores whether AI repeats the culture industry's pattern of suppressing novelty under the guise of variety.
positive sentiment bias as affective signature of the AI culture industry
Can humans detect AI text if machines can measure it? AI-generated text shows measurable differences from human writing across multiple linguistic dimensions, yet human judges consistently fail to identify it. Why does the gap between what is measurable and what is perceptible exist?
the detection gap: text is statistically distinguishable but pragmatically indistinguishable
Why do fake news detectors flag AI-generated truthful content? Fake news detectors may systematically misclassify LLM-generated text as deceptive. We explore whether this bias stems from detecting AI style rather than actual falsehood, and what that means for detection accuracy.
AI detection as proxy for style detection, not truth detection

How much of the internet is AI-generated now?

Inquiring lines that read this note 5

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4