INQUIRING LINE

How does the task type change which linguistic features distinguish AI from humans?

This explores whether the linguistic 'tells' that separate AI from human writing stay constant, or whether they shift depending on what the AI is asked to do — argue, recount an experience, take a stance, make a joke.


This reads the question as asking whether AI's giveaway features are fixed, or whether they change with the task. The corpus is clear: the tells are task-dependent. There's no single linguistic fingerprint — each kind of writing exposes a different seam.

Start with argument-writing. When LLMs generate counter-arguments, the distinguishing features are accommodation to the prompt and 'textbook-quality' argument markers — the prose is too tidily structured, too eager to address the assignment, in ways humans rarely bother to be. Simple interpretable features catch this at 99% accuracy Can simple linguistic features detect AI-written arguments?. But shift the task to recounting a personal experience — a hotel review, a memory — and a completely different signature appears. Here the tell isn't structure but a kind of structural falsity: AI describing experiences it never had produces higher analytic complexity, more emotional and descriptive language, and lower readability than even intentional human lying How does AI-generated false experience differ linguistically from human deception?. The detectable feature in argumentation is competence; in narration it's an inability to be plausibly mundane Does AI-generated text lose core properties of human writing?.

Change the task again — to evaluative or rhetorical writing — and yet another gap opens. LLMs have mastered grammar but avoid taking a stance: they lean on 'manner nouns' and anaphoric references that stay descriptively neutral, where human writers reach for status and evidential nouns that carry judgment. The result is organizationally coherent but argumentatively inert prose Why does AI writing sound generic despite being grammatically correct?. Push toward humor or wordplay and the failure becomes almost mechanical: transformers aggregate every word's meaning in parallel rather than selectively suppressing the irrelevant ones, so they miss the frame-activation that jokes depend on Why do AI systems miss jokes and wordplay so consistently?. Same model, different task, different breaking point.

What's striking is that across all these tasks, a deeper pattern holds and even widens. Newer models diverge *further* from human lexical patterns on six measurable dimensions of vocabulary diversity — yet human judges, including trained linguists, can't perceive the difference Can human judges detect measurable differences in AI text? Why do newer AI models diverge further from human writing patterns?. RLHF seems to optimize for quality ratings, not human-likeness, so the gap grows even as it becomes invisible Can humans detect AI text if machines can measure it?. The task-specific tells (accommodation, false experience, missing stance) are the ones machines *and* trained features can catch; the lexical divergence is the one nobody's eyes can.

The thing you might not have expected to want to know: this means 'distinguishing AI from human' isn't one detection problem but many, and the choice of task is what decides whether the difference is surface or structural. Some researchers argue the divergence is contingent — better training narrows it. Others locate it in the architecture and the act itself: LLMs produce strings via probability distributions while humans use language to address and relate to someone Are language models and human speakers doing the same thing?, and AI output is 'event-residue' that readers animate into a pseudo-exchange Does AI generate genuine utterances or just text patterns?. On that view the difference is categorical from the outside but subtle from inside shared discourse Do humans and LLMs differ fundamentally or just superficially? — which is exactly why the answer keeps changing with the task you set.


Sources 11 notes

Can simple linguistic features detect AI-written arguments?

General linguistic features combined with argument-quality measures achieved 99% accuracy detecting LLM-generated counter-arguments on r/ChangeMyView, matching heavyweight neural detectors while remaining computationally cheap and transparent. LLMs produce detectable stylistic signatures: accommodation to prompts and textbook-quality argument markers that humans don't replicate.

How does AI-generated false experience differ linguistically from human deception?

AI text about personal experiences is inherently false by structural necessity, not intent. Compared to intentional human deception, it shows higher analytic complexity, greater emotional content, more descriptive language, and lower readability—detectable with >80% accuracy.

Does AI-generated text lose core properties of human writing?

Research shows artificial text disrupts dialogic symmetry, context continuity, embodied authorship, and political situatedness. These are not surface flaws but structural absences—AI hotel reviews show 80%+ detection accuracy due to inherent falsity about personal experience distinct from human deception.

Why does AI writing sound generic despite being grammatically correct?

AI text uses manner nouns and anaphoric references that are descriptively neutral, while human writers use status and evidential nouns that carry evaluative weight. This produces organizationally coherent but argumentatively inert prose.

Why do AI systems miss jokes and wordplay so consistently?

Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.

Can human judges detect measurable differences in AI text?

Six-dimension MANOVA analysis confirms significant differences between ChatGPT and human writing across vocabulary volume, abundance, variety, evenness, disparity, and dispersion. Despite these robust statistical differences, human judges including linguists and NLP researchers fail to reliably distinguish AI from human text.

Why do newer AI models diverge further from human writing patterns?

ChatGPT-4.5 and o4-mini show greater lexical diversity differences from human text than earlier models, yet human judges cannot reliably distinguish them. Training objectives like RLHF appear to optimize for quality ratings rather than human-like writing patterns.

Can humans detect AI text if machines can measure it?

LLM-generated text differs significantly on six lexical diversity dimensions, confirmed through statistical analysis across multiple models. Yet human judges, including trained linguists, cannot reliably detect these differences—and newer models diverge further while becoming harder to spot.

Are language models and human speakers doing the same thing?

LLMs produce strings via probability distributions; humans use language to address and relate to others. They share surface form but differ in what produces output, what it does socially, and what receivers should do with it.

Does AI generate genuine utterances or just text patterns?

AI output carries communicative markers inherited from training data but lacks the event structure that produces actual utterances. Users supply the missing orientation through interpretive labor, creating a pseudo-event with structure only on the human side.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a detection researcher re-evaluating whether task-dependent linguistic fingerprints of AI hold up under newer models and evaluation methods. The core question: do distinguishing features between AI and human text remain task-dependent, or have recent capability jumps or training shifts collapsed these task-specific tells into a single regime?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. A curated library reported:
• Argumentation tasks expose accommodation and 'textbook-quality' structure at 99% detection accuracy via lightweight features (~2024).
• Personal narrative tasks reveal AI's inability to be mundane: higher analytic complexity, more description, lower readability than human deception (~2024).
• Evaluative/rhetorical writing shows AI's avoidance of stance via manner nouns and anaphoric neutrality, producing argumentatively inert prose (~2024).
• Humor/wordplay expose parallel-aggregate word meaning rather than selective frame suppression (~2024).
• Lexical diversity diverges measurably in newer models across six dimensions, yet remains imperceptible to human judges, including linguists; RLHF optimizes quality, not human-likeness (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.00750 (2024-03): Can Language Models Recognize Convincing Arguments?
• arXiv:2508.00086 (2025-07): Do LLMs produce texts with "human-like" lexical diversity?
• arXiv:2510.14665 (2025-10): Beyond Hallucinations: The Illusion of Understanding in Large Language Models.
• arXiv:2604.22503 (2026-04): Measuring and Mitigating Persona Distortions from AI Writing Assistance.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the task-specific tells (accommodation in argument, false experience in narrative, missing stance in evaluation, frame-insensitivity in humor), probe whether post-2025 models with instruction-tuning, constitutional AI, or adversarial SFT have narrowed or eliminated these gaps. Separately: has the lexical divergence been addressed by new training regimes? Isolate what remains genuinely unsolved from what newer evals or methods now catch.
(2) Surface work from the last ~6 months that CONTRADICTS the task-dependency hypothesis — i.e., evidence that a *single* linguistic feature now generalizes across task types, or that task selection no longer matters for detection.
(3) Propose two research questions: (a) Does fine-tuning for "human-like" outputs (rather than quality) collapse the task-specific tells into one? (b) Can multi-task detection (training on mixed tasks) outperform task-specific detectors, or does task-conditional architecture remain necessary?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines