Argument Collapse: LLMs Flatten Long-Form Public Debate
As LLMs are increasingly used to draft publicfacing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human subarguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals.
Introduction. LLMs are now common aids in public-facing argumentative writing, including opinion essays and policy memos (Lee et al., 2022; Russell et al., 2026a). Such usage comes with a caveat: model suggestions can alter what claims writers make, how those claims are supported, and how much of the writer’s own voice remains present (Padmakumar and He, 2024; Doshi and Hauser, 2024; Abdulhai et al., 2026; Röttger et al., 2026a). Prior work shows that LLMs can produce generative monocultures through narrowing output distributions (Wu et al., 2025; Zhang et al., 2025b; Jiang et al., 2026; Nie et al., 2026) and reducing epistemic diversity in generated claims (Wright et al., 2025). These homogenization measures, however, often do not directly compare human and LLM output distributions under the same task conditions, leaving it unclear what is lost when people write arguments with model assistance (Jain et al., 2025).
Discussion / Conclusion. We define argument collapse as the tendency of independently generated essays to converge on the same small set of plausible arguments. Across New York Times debates and longer-form Boston Review forums, we find collapse at three levels: main arguments, supporting reasons, and argumentative structure. Model-generated essays converge on fewer main arguments, reuse supporting reasons more often even under a shared main argument, and follow a more standardized argumentative arc. The risk is not only surface idiosyncrasy or opinion bias, but a narrowing of the range of arguments readers encounter, potentially amplifying dominant ones and limiting long-tail reasoning. Our study focuses on measuring argumentative diversity rather than argument quality or factual accuracy. Even if humans generate more distinctive arguments, this does not necessarily mean those arguments are better than others. Thus, our results should not be interpreted as showing that human essays are always more persuasive, more accurate, or more preferable than LLM-generated essays.