The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing
These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles: pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale.
Introduction. The proliferation of LLM-generated content on the web has raised urgent questions about content provenance and authenticity. Prior work has focused on stylometric detection and watermarking at the token level (Kirchenbauer et al., 2023). We identify a complementary signal that requires no model access and leaves no intentional mark: the name prior. When prompted to generate fictional experts, researchers, or protagonists without explicit name instructions, large language models default to a small set of high-probability names. We show they are correlated (models generate preferred character ensembles, not independent draws) and modelversion-specific, shifting at release boundaries. Because enormous volumes of web content are generated using LLMs without overriding these defaults, the characteristic name ensembles of each model version become embedded in the content it produces. The web is an unintentional archive of LLM behavioral fingerprints. The consequences extend beyond the open web.
Discussion / Conclusion. We have shown that LLMs generate correlated character ensembles, not merely high-probability individual names, that are model-family-specific, version-specific, and actively suppressed at release boundaries; the suppression is itself evidence that the priors were strong enough to be noticed. These ghost names propagate from model outputs into AI-generated web content and from there into academic publishing infrastructure. On Zenodo alone, 1,655 ghost-authored records with real DataCite DOIs were registered in a 60-day automated burst, claiming nonexistent journals with backdated publication dates; the infrastructure for large-scale scholarly record contamination is already in place. The academic record is being quietly haunted. Our probing study covers only publicly accessible API checkpoints; internal or fine-tuned models are not covered. Prompt set size (30 prompts per condition) is sufficient to establish dominant priors but may miss lower-frequency names. Web corpus collection via Google Search (Serper) is subject to recency bias in the age field; page-level publication dates from slop sites are unreliable. Research- Gate paper dates are more trustworthy but require systematic collection at scale, which is ongoing.