Retrieval Collapses When AI Pollutes the Web
The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both highquality SEO-style content and adversarially crafted content. In the SEO scenario, a 67% pool contamination led to over 80% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed ∼19% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.
Introduction. The rapid proliferation of Large Language Models (LLMs) has fundamentally transformed the landscape of Web content creation [16]. Although this shift offers scalability in information production, it introduces a critical structural vulnerability for search engines and Retrieval-Augmented Generation (RAG) systems [9, 12]. These systems increasingly consume evidence that is itself generated by the very models they rely on, creating a self-referential cycle. While similar phenomena have been studied in model training as model collapse [1, 15], the implications for the retrieval ecosystem remain underexplored. We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage degradation process. The first stage, Dominance and Homogenization, occurs when high-quality, SEO-optimized synthetic content captures the top search results, drastically reducing source diversity.
Discussion / Conclusion. We formally introduced and empirically validated Retrieval Collapse, a two-stage structural issue where synthetic content first achieves Dominance and subsequently facilitates System Corruption. Our Scenario 1 findings expose a critical loss in source diversity, introducing extreme brittleness where high accuracy masks ecosystem decay. Scenario 2 demonstrates that scalable baselines like BM25 are critically vulnerable to adversarial pollution (19% exposure), whereas LLM-based rankers offer resilience but at high computational cost. By establishing the framework of Retrieval Collapse, this work lays the foundation for understanding how synthetic content reshapes information retrieval. To mitigate these risks, we propose a shift toward Defensive Ranking strategies that jointly optimize relevance, factuality, and provenance [10].