SYNTHESIS NOTE

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Synthesis note · 2026-02-21 · sourced from Discourses

The lexical diversity study compared ChatGPT-3.5, 4, o4-mini, and 4.5. The key finding: the newer models — o4-mini and 4.5 — differ most from human-written text on lexical diversity measures. They are the least human-like by measurable metric.

At the same time, human judges consistently fail to detect AI-generated text regardless of model version. More capable models don't become easier to detect; the failure of human judgment is stable across model generations.

ChatGPT-4.5 produces higher lexical diversity than older models despite generating fewer tokens — it is more lexically dense, but the density pattern is still non-human. The implication: newer models aren't converging on human-like writing by becoming better at mimicking human lexical patterns; they are becoming better at generating high-quality text that is nonetheless systematically different from human text.

This suggests that the training objective (RLHF, quality preference) is pushing models toward a different optimum than "human-like lexical diversity." The optimum models converge on is rated higher quality by human raters but is more measurably distinct from how humans naturally write.

The widening gap between measurable and perceptible has an important practical consequence: as models improve, naive human-based detection becomes less viable, not more. Detection requires moving to statistical/computational analysis that humans don't spontaneously perform.

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can AI-generated outputs constitute genuine knowledge or valid claims?

Why do different AI models generate similar outputs independently?

Does conversational format create illusions of genuine AI communication?

What makes AI posts less likely to invite replies than human-written content?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

Does AI text rewriting systematically distort writer intent and preference?

Do language models learn genuine linguistic structure or just surface patterns?

When does optimizing for quality undermine the value of diversity?

Why do preference-tuned models produce different diversity patterns in code versus creative writing?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 96 in 2-hop network ·medium cluster Open in graph ↗

Why do newer AI models diverge further from huma… Can human judges detect measurable differences in … Can humans detect AI text if machines can measure …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

newer llm generations diverge further from human lexical patterns while becoming harder to detect

Why do newer AI models diverge further from human writing patterns?

Inquiring lines that read this note 17

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4