SYNTHESIS NOTE
Language, Text, and Discourse Psychology, Society, and Alignment

How much of the internet is AI-generated now?

What share of newly published websites contain AI-generated or AI-assisted content, and what measurable changes does this cause across semantic diversity, sentiment, accuracy, and style?

Synthesis note · 2026-04-18 · sourced from Social Theory Society
Why do AI systems fail at social and cultural interpretation? What kind of thing is an LLM really?

A representative sample of websites from the Internet Archive (2022-2025) measured with a state-of-the-art AI text detector finds that "roughly 35% of newly published websites were classified as AI-generated or AI-assisted" by mid-2025, up from zero before ChatGPT's launch in late 2022. This is the first large-scale empirical baseline for a phenomenon previously discussed only through anecdote and speculation (the "Dead Internet Theory").

What the data shows:

The perception gap. A user study found that the majority of US adults believe all four hypotheses (reduced semantic diversity, increased positive sentiment, decreased factual accuracy, decreased stylistic diversity). People who do not use AI or use it infrequently believe in the negative impacts more; those who hold negative views of AI believe more strongly in the hypotheses. The perception of harm exceeds the measured harm on two of four dimensions — but is validated on the other two. Public fear is neither paranoia nor prophecy; it is half right.

The semantic diversity finding is the key result. Stylistic diversity is preserved — the words vary — but semantic diversity declines. This mirrors the pattern from since Do different AI models actually produce diverse outputs?: surface variation masks idea convergence. The internet is saying the same things in different ways.

Connection to model collapse. Since Does training on AI-generated content permanently degrade model quality?, the 35% AI content baseline establishes the starting condition for recursive degradation. If future models train on web crawls that are already one-third AI-generated, the tail distribution loss accelerates. The semantic diversity decline measured here may be the early empirical signal of model collapse manifesting in the wild, not in lab experiments.

The positive sentiment bias confirms what the homogeneity research predicts: AI output defaults to agreeable, constructive, and upbeat framing. Since Does AI homogenize culture the way mass media did?, the sentiment shift represents the AI culture industry's affective signature — systematically positive, systematically inoffensive, systematically unremarkable.

The factual accuracy non-finding is surprising given hallucination concerns but may reflect selection effects: AI-generated websites that contain obvious factual errors may be less likely to persist in the archive, or factual accuracy may be domain-dependent in ways the aggregate measure misses.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 150 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

35 percent of new websites are AI-generated by mid-2025 — semantic diversity declines and positive sentiment rises but factual accuracy and stylistic diversity are unaffected