Does high-frequency text homogenize user input before generation?
Does Adam's Law reveal how LLMs flatten distinctive user voices at the parsing stage, not just in output? This matters because it could explain why model accuracy and generic responses emerge from the same mechanism.
Adam's Law surfaces a tension that earlier homogenization research could not localize. Do different AI models actually produce diverse outputs? documents output convergence; How much of the internet is AI-generated now? tracks that convergence at internet scale; Do LLMs compress concepts more aggressively than humans do? describes the representational mechanism. What was missing was an input-side account: how distinct user voices get flattened before the model starts generating.
Adam's Law supplies it. The model prefers high-frequency surface forms at the comprehension stage. Users iteratively rephrase their prompts toward higher quality, which empirically means toward higher frequency, which means toward median register. Distinct prompts — a domain expert's specialized phrasing, a regional dialect, a technical idiolect — get pre-processed by the user's own paraphrasing toward whatever phrasing the model handles best, which is whatever phrasing the corpus contained most. Homogenization happens in the parsing of the request, not just in the generation of the response.
The tension is sharp: the same property that gives LLMs their accuracy on common tasks — strong modeling of dense distributional regions — is the property that filters out distinctiveness on the input side. There is no cheap fix because the mechanism is constitutive of how the model works, not a bug in a post-processing layer. Tokenization-of-intelligence, in this frame, is tokenization toward the corpus mean; the input channel and the output channel both narrow toward the high-frequency centroid. A user with a distinctive voice trying to use the model effectively is in an asymmetric trade: speak distinctively and lose accuracy, or speak in the model's preferred register and lose voice. There is no third option that the architecture provides.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do AI-generated posts crowd out human voices without any coordination or intent?
- Does homogenization at the text level cause homogenization of perceived authors?
- What happens to solidarity and community signaling when AI smooths out voice differences?
- Why does statistical compression destroy literary connotation and meaning?
- How do you attribute copyright when billions of inputs shape one model?
- How does tokenization toward corpus mean affect downstream output diversity?
- What makes output convergence across models inevitable given input-side homogenization?
- Can distinctive input voices maintain accuracy without adopting the model's preferred register?
- Why do different readers extract different meanings from identical text?
- How do power-law distributions differ from uniform collision assumptions?
- What information does transcription destroy that direct speech-to-speech models preserve?
- How does uniform code distribution make items more distinguishable?
- How does entrainment between speaker and listener build mutual scaling?
- How does smooth generation lead to proliferation without new viewpoints?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do different AI models actually produce diverse outputs?
Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
output-side convergence; Adam's Law is input-side mechanism
-
How much of the internet is AI-generated now?
What share of newly published websites contain AI-generated or AI-assisted content, and what measurable changes does this cause across semantic diversity, sentiment, accuracy, and style?
internet-scale signal of the same dynamic
-
Do LLMs compress concepts more aggressively than humans do?
Do language models prioritize statistical compression over semantic nuance when forming conceptual representations, and how does this differ from human category formation? This matters because it may explain why LLMs fail at tasks requiring fine-grained distinctions.
representational counterpart
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Adam's Law: Textual Frequency Law on Large Language Models
- Creativity Has Left the Chat: The Price of Debiasing Language Models
- NoveltyBench: Evaluating Language Models for Humanlike Diversity
- Measuring and Mitigating Persona Distortions from AI Writing Assistance
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
- Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
- Large Language Models Do Not Simulate Human Psychology
Original note title
high-frequency text is the homogenization channel — the same mechanism that makes LLMs accurate also makes them generic