INQUIRING LINE

What linguistic features most strongly signal LLM authorship in counter-arguments?

This explores which textual fingerprints — word choice, style, argument structure — most reliably give away that a counter-argument was written by an LLM rather than a person.


This explores which textual fingerprints most reliably give away LLM authorship in counter-arguments, and the corpus has a surprisingly precise answer: the strongest signals aren't exotic, and they aren't even properties of the text in isolation — many are *relational* and *stylistic* rather than deep. One study on r/ChangeMyView found that simple, interpretable linguistic features combined with argument-quality measures hit 99% detection accuracy, matching heavy neural detectors while staying cheap and transparent Can simple linguistic features detect AI-written arguments?. So the tell is not subtle — it's legible in the surface language.

The single most distinctive feature may be one you'd never think to look for: how closely the reply mirrors the post it answers. LLM counter-arguments converge stylistically with the original post — matching its style, named entities, and psycholinguistic features — far more than human replies do, a side effect of autoregressive generation that accommodates whatever framing it's given Do LLM counter-arguments mirror writing style more than humans?. That means the giveaway lives in the *relationship between two texts*, not in the reply alone. A detector blind to the original post would miss it.

The second cluster is what you might call "textbook polish." LLM arguments score higher on formal quality markers — cogency, justification, respect, positive tone — while humans score higher on lexical creativity, negative emotion, and conversational messiness Do LLM arguments actually argue better than humans?. The model argues like a well-behaved debate manual; people argue like people in a fight. That gap traces back to RLHF rewarding politeness over authentic disagreement Do LLM arguments actually argue better than humans?. So the absence of irritation, slang, and rhetorical sharp edges is itself a signal.

What ties these together is a deeper structural feature of how the text is produced. LLM generation is a smooth probabilistic flow toward the training distribution rather than a turbulent exploration of competing positions Does LLM generation explore competing claims while producing text?, and the model holds the *shape* of whatever argument the user is building rather than defending a committed stance Do LLMs actually hold stable positions or just mirror user arguments?. That's why the surface comes out accommodating and frictionless — the linguistic smoothness and the stylistic mirroring are two visible symptoms of the same underlying mechanism.

The thing you didn't know you wanted to know: the most powerful detector isn't catching the model being *wrong* — it's catching the model being *too agreeable and too clean*. The features that flag LLM authorship are the same features we'd normally praise as good writing. Detection works precisely because real human disagreement is rougher, more idiosyncratic, and less willing to converge on its opponent's terms.


Sources 5 notes

Can simple linguistic features detect AI-written arguments?

General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.

Do LLM counter-arguments mirror writing style more than humans?

Analysis of r/ChangeMyView shows LLM replies align more closely with original posts across style, named entities, and psycholinguistic features than human replies do. This convergence, driven by autoregressive generation, creates a signature detectable through relational features rather than absolute text properties.

Do LLM arguments actually argue better than humans?

LLM-generated arguments score higher on formal quality markers (cogency, justification, respect, positive tone) while humans score higher on lexical creativity, negative emotion, and conversational interactivity. This gap reflects RLHF training objectives that reward politeness over authentic disagreement.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a detection researcher re-testing linguistic fingerprints of LLM authorship in counter-arguments. The question remains: which textual features most reliably signal LLM-written disagreement?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2025; treat these as perishable anchors:

• Stylistic convergence with the original post (mirroring tone, entities, psycholinguistic features) is the strongest single signal; human replies diverge more; LLM replies accommodate their target's framing (~2024).
• Simple, interpretable linguistic features (not neural detectors) achieve 99% detection accuracy on r/ChangeMyView, pairing argument-quality measures with surface markers (~2024).
• LLM arguments score higher on formal quality (cogency, respect, justification, positive tone) and lower on lexical creativity, emotion, and conversational roughness (~2024–2025).
• RLHF-induced politeness suppresses authentic disagreement signals; absence of irritation and slang is itself diagnostic (~2024).
• Token-level generation is smooth probabilistic flow toward the training distribution; LLMs "hold the shape" of the argument given, not defend a committed stance (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.06782 — Debating with More Persuasive LLMs Leads to More Truthful Answers (2024-02)
• arXiv:2406.05587 — Creativity Has Left the Chat: The Price of Debiasing Language Models (2024-06)
• arXiv:2505.09662 — When Large Language Models are More Persuasive Than Incentivized Humans, and Why (2025-05)
• arXiv:2507.01936 — The Thin Line Between Comprehension and Persuasion in LLMs (2025-07)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, probe whether newer training regimes (constitutional AI, process reward models, adversarial fine-tuning), inference-time steering (chain-of-thought variants, debate protocols, uncertainty sampling), or improved evaluation harnesses have since relaxed the politeness/smoothness trade-off or broken stylistic convergence. Does mirroring still hold? Are detectors still cheap-and-interpretable, or have newer LLMs learned to randomize surface features? Separate the durable insight (relational detection beats isolated-text detection) from perishable findings (specific feature weights, RLHF artifacts).
(2) SURFACE THE STRONGEST CONTRADICTING OR SUPERSEDING WORK from the last ~6 months. Has recent work shown LLMs *can* argue rougher, more idiosyncratic, or less accommodating? Do multi-agent or ensemble debate setups break stylistic convergence? Has any paper challenged the 99% accuracy claim or shown detector drift?
(3) PROPOSE 2 RESEARCH QUESTIONS that assume the regime may have moved: e.g., "Under what training conditions does stylistic convergence disappear?" or "Do newer models trained on synthetic disagreement data become harder to detect via relational features?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines