Featured

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

Michał Brzozowski, Neo Christopher Chung · arXiv:2606.02184

The discovery of these correlated phantom authors cuts to the heart of a deeper question about what LLMs actually encode: models don't just converge on plausible outputs, they converge on predictable co-occurrence patterns, creating signatures as legible as fingerprints. This matters not because ghost names are merely a curiosity, but because they've begun polluting real scholarly infrastructure—systems designed to catalog genuine knowledge now index synthetic ensembles alongside real research, blurring the boundary between fabrication and documentation. The deeper tension is that LLM outputs represent subjective belief distributions encoded in training data, not objective observation—and when those distributions leak into citation graphs and DOI registries, we lose the ability to distinguish signal from artifact. How do we design scholarly gatekeeping systems when the generative models themselves encode brand-specific hallucinations that downstream tools cannot distinguish from legitimate publication metadata?

Abstract

These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.

Adjacent research

Synthesis notes nearest this paper, framed as questions — click to read.

Do different AI models actually produce diverse outputs? How much of the internet is AI-generated now? Should we treat LLM outputs as real empirical data?

Lines of inquiry this paper opens

Explore in faceted view

Not questions with answers — ways of approaching this research. Each opens a synthesized line of inquiry across the collection.

AI Text Perception And Authorship

Scaling, Sparsity & Data Trade-offs

LLM Discourse And Social Reasoning

Capability Boundaries And Diversity Collapse

What makes creative writing diversity different from code diversity fundamentally?

Reasoning Trace Reliability

How does treating synthetic data as empirical evidence contaminate statistical inference?

Reasoning Model Self-Correction Failures

What role should the trust parameter play in using synthetic data as evidence?

LLM Reasoning Limitations

What makes LLM outputs fabrication rather than hallucination or confabulation?

All featured →