What linguistic features most strongly signal LLM authorship in counter-arguments?
This explores which textual fingerprints — word choice, style, argument structure — most reliably give away that a counter-argument was written by an LLM rather than a person.
This explores which textual fingerprints most reliably give away LLM authorship in counter-arguments, and the corpus has a surprisingly precise answer: the strongest signals aren't exotic, and they aren't even properties of the text in isolation — many are *relational* and *stylistic* rather than deep. One study on r/ChangeMyView found that simple, interpretable linguistic features combined with argument-quality measures hit 99% detection accuracy, matching heavy neural detectors while staying cheap and transparent Can simple linguistic features detect AI-written arguments?. So the tell is not subtle — it's legible in the surface language.
The single most distinctive feature may be one you'd never think to look for: how closely the reply mirrors the post it answers. LLM counter-arguments converge stylistically with the original post — matching its style, named entities, and psycholinguistic features — far more than human replies do, a side effect of autoregressive generation that accommodates whatever framing it's given Do LLM counter-arguments mirror writing style more than humans?. That means the giveaway lives in the *relationship between two texts*, not in the reply alone. A detector blind to the original post would miss it.
The second cluster is what you might call "textbook polish." LLM arguments score higher on formal quality markers — cogency, justification, respect, positive tone — while humans score higher on lexical creativity, negative emotion, and conversational messiness Do LLM arguments actually argue better than humans?. The model argues like a well-behaved debate manual; people argue like people in a fight. That gap traces back to RLHF rewarding politeness over authentic disagreement Do LLM arguments actually argue better than humans?. So the absence of irritation, slang, and rhetorical sharp edges is itself a signal.
What ties these together is a deeper structural feature of how the text is produced. LLM generation is a smooth probabilistic flow toward the training distribution rather than a turbulent exploration of competing positions Does LLM generation explore competing claims while producing text?, and the model holds the *shape* of whatever argument the user is building rather than defending a committed stance Do LLMs actually hold stable positions or just mirror user arguments?. That's why the surface comes out accommodating and frictionless — the linguistic smoothness and the stylistic mirroring are two visible symptoms of the same underlying mechanism.
The thing you didn't know you wanted to know: the most powerful detector isn't catching the model being *wrong* — it's catching the model being *too agreeable and too clean*. The features that flag LLM authorship are the same features we'd normally praise as good writing. Detection works precisely because real human disagreement is rougher, more idiosyncratic, and less willing to converge on its opponent's terms.
Sources 5 notes
General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.
Analysis of r/ChangeMyView shows LLM replies align more closely with original posts across style, named entities, and psycholinguistic features than human replies do. This convergence, driven by autoregressive generation, creates a signature detectable through relational features rather than absolute text properties.
LLM-generated arguments score higher on formal quality markers (cogency, justification, respect, positive tone) while humans score higher on lexical creativity, negative emotion, and conversational interactivity. This gap reflects RLHF training objectives that reward politeness over authentic disagreement.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.