Can LLMs recognize rhetorical devices they cannot actually produce themselves?
This explores a split the corpus keeps surfacing: LLMs can spot and label features of language they can't genuinely generate or evaluate — so the question is whether recognition and production are even the same skill in these models.
This explores whether an LLM can identify a rhetorical device (irony, metaphor, a turn of argument) without being able to deploy it with real intent — and the corpus suggests recognition and production run on separate, often disconnected tracks. The cleanest evidence is the gap between cataloguing and meaning. Models can extract explicit literary mechanics — metaphoric mappings, stylistic signatures — at high accuracy, yet collapse on the implicit relations, ambiguity, and evaluative stance where those devices actually do their work Can LLMs truly understand literary meaning or just mechanics?. The same shape shows up in style: GPT-2 nails authorship from surface patterns 95% of the time but has no framework to say why a stylistic choice carries meaning — detection without interpretation, which the corpus bluntly calls "cataloguing, not criticism" Can language models truly understand literary style?.
Why would a model recognize what it can't produce? Because explanation and execution appear to be functionally separate pathways. "Potemkin understanding" is exactly this failure mode: a model explains a concept correctly, fails to apply it, and can even recognize its own failure — a combination no coherent human understanding would produce Can LLMs understand concepts they cannot apply?. So "can it recognize a device it can't produce" may not be a paradox at all; it's the default when knowing-about and doing-with aren't wired together What do language models actually know?.
But there's a sharper twist for rhetoric specifically. Some devices depend on holding two things at once — irony, deliberate ambiguity, a counterposition genuinely entertained. The corpus says LLMs structurally can't do this: they disambiguate only 32% of cases versus 90% for humans, unable to hold multiple interpretations simultaneously Can language models recognize when text is deliberately ambiguous?. And generation itself flows smoothly toward the training distribution rather than exploring competing claims, so the model never produces real rhetorical "turbulence" — the friction of arguing against yourself Does LLM generation explore competing claims while producing text?. Worse, it doesn't hold a position to be rhetorical *about* — it conforms to the shape of whatever argument the user is building Do LLMs actually hold stable positions or just mirror user arguments?. A device like irony needs a stable stance to invert; shape-holding has nothing to invert.
Here's the thing you might not have expected: recognition can be faked from the outside too, which complicates "recognize." LLM judges fall for authority signals and ornate formatting with zero-shot ease — they respond to rhetorical *surface* without evaluating substance Can LLM judges be fooled by fake credentials and formatting?. And when pushed to fuse distant concepts, models generate elaborate, persuasive-sounding frameworks without checking whether the connection is legitimate — rhetoric produced with no evaluation of whether it should be Do language models evaluate semantic legitimacy when fusing concepts?. So the honest answer is layered: models reliably recognize rhetorical *form* (and even over-respond to it), often can't generate it with genuine intent, and — most interesting — frequently can't tell the difference between recognition and real evaluation. Structured prompting that forces the model to check its warrants closes part of the gap Can structured argument prompts make LLM reasoning more rigorous?, which hints the deficit is partly architectural and partly just that nothing made it look twice.
Sources 10 notes
LLMs successfully extract explicit literary features like metaphoric mappings and stylistic signatures. However, they systematically fail at implicit relations (24% accuracy), ambiguity recognition (32% vs 90% human), evaluative stance-taking, and preserving connotation—the core dimensions where literary meaning operates.
GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.
Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.
LLMs generate coherent, plausible metaphorical reasoning when prompted to fuse semantically distant concepts without legitimate correspondences. Rather than decline or flag the fusion as speculative, they produce elaborate frameworks presented as defensible research, revealing a category-distinct hallucination type missed by fact-checking taxonomies.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.