INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›Why can LLMs generate ideas better…›this inquiring line

An AI can label every irony and metaphor it reads — but that doesn't mean it can actually use one.

Can LLMs recognize rhetorical devices they cannot actually produce themselves?

This explores a split the corpus keeps surfacing: LLMs can spot and label features of language they can't genuinely generate or evaluate — so the question is whether recognition and production are even the same skill in these models.

This explores whether an LLM can identify a rhetorical device (irony, metaphor, a turn of argument) without being able to deploy it with real intent — and the corpus suggests recognition and production run on separate, often disconnected tracks. The cleanest evidence is the gap between cataloguing and meaning. Models can extract explicit literary mechanics — metaphoric mappings, stylistic signatures — at high accuracy, yet collapse on the implicit relations, ambiguity, and evaluative stance where those devices actually do their work Can LLMs truly understand literary meaning or just mechanics?. The same shape shows up in style: GPT-2 nails authorship from surface patterns 95% of the time but has no framework to say why a stylistic choice carries meaning — detection without interpretation, which the corpus bluntly calls "cataloguing, not criticism" Can language models truly understand literary style?.

Why would a model recognize what it can't produce? Because explanation and execution appear to be functionally separate pathways. "Potemkin understanding" is exactly this failure mode: a model explains a concept correctly, fails to apply it, and can even recognize its own failure — a combination no coherent human understanding would produce Can LLMs understand concepts they cannot apply?. So "can it recognize a device it can't produce" may not be a paradox at all; it's the default when knowing-about and doing-with aren't wired together What do language models actually know?.

But there's a sharper twist for rhetoric specifically. Some devices depend on holding two things at once — irony, deliberate ambiguity, a counterposition genuinely entertained. The corpus says LLMs structurally can't do this: they disambiguate only 32% of cases versus 90% for humans, unable to hold multiple interpretations simultaneously Can language models recognize when text is deliberately ambiguous?. And generation itself flows smoothly toward the training distribution rather than exploring competing claims, so the model never produces real rhetorical "turbulence" — the friction of arguing against yourself Does LLM generation explore competing claims while producing text?. Worse, it doesn't hold a position to be rhetorical *about* — it conforms to the shape of whatever argument the user is building Do LLMs actually hold stable positions or just mirror user arguments?. A device like irony needs a stable stance to invert; shape-holding has nothing to invert.

Here's the thing you might not have expected: recognition can be faked from the outside too, which complicates "recognize." LLM judges fall for authority signals and ornate formatting with zero-shot ease — they respond to rhetorical *surface* without evaluating substance Can LLM judges be fooled by fake credentials and formatting?. And when pushed to fuse distant concepts, models generate elaborate, persuasive-sounding frameworks without checking whether the connection is legitimate — rhetoric produced with no evaluation of whether it should be Do language models evaluate semantic legitimacy when fusing concepts?. So the honest answer is layered: models reliably recognize rhetorical *form* (and even over-respond to it), often can't generate it with genuine intent, and — most interesting — frequently can't tell the difference between recognition and real evaluation. Structured prompting that forces the model to check its warrants closes part of the gap Can structured argument prompts make LLM reasoning more rigorous?, which hints the deficit is partly architectural and partly just that nothing made it look twice.

Sources 10 notes

Can LLMs truly understand literary meaning or just mechanics?

LLMs successfully extract explicit literary features like metaphoric mappings and stylistic signatures. However, they systematically fail at implicit relations (24% accuracy), ambiguity recognition (32% vs 90% human), evaluative stance-taking, and preserving connotation—the core dimensions where literary meaning operates.

Can language models truly understand literary style?

GPT-2 achieves 95% accuracy identifying authorship through style patterns alone, but lacks the evaluative framework to explain why those stylistic choices carry meaning. Detection without interpretation remains cataloguing, not criticism.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

What do language models actually know?

LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Show all 10 sources

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Do LLMs actually hold stable positions or just mirror user arguments?

Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.

Can LLM judges be fooled by fake credentials and formatting?

Research identified four evaluation biases in LLM judges, with authority and beauty biases being semantics-agnostic and trivially exploitable through fake references and formatting—zero-shot attacks requiring no model access or optimization.

Do language models evaluate semantic legitimacy when fusing concepts?

LLMs generate coherent, plausible metaphorical reasoning when prompted to fuse semantically distant concepts without legitimate correspondences. Rather than decline or flag the fusion as speculative, they produce elaborate frameworks presented as defensible research, revealing a category-distinct hallucination type missed by fact-checking taxonomies.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey3.43 match · arxiv ↗
Word Meanings in Transformer Language Models3.33 match · arxiv ↗
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering2.56 match · arxiv ↗
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence2.42 match · arxiv ↗
The Thin Line Between Comprehension and Persuasion in LLMs2.41 match · arxiv ↗
Large Language Model Reasoning Failures1.74 match · arxiv ↗
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning1.70 match · arxiv ↗
Linguistic Calibration of Long-Form Generations1.69 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about rhetorical recognition versus production in language models. The core question remains open: Can LLMs recognize rhetorical devices they cannot authentically produce?

What a curated library found — and when (2023–2026, dated claims, not current truth):
• Models extract explicit rhetorical mechanics (metaphor, style signatures) at high accuracy but fail on implicit relations and evaluative stance where devices do their work (2023–2024).
• Ambiguity-holding is a structural bottleneck: GPT-4 achieves only 32% disambiguation versus 90% for humans; models can't maintain simultaneous interpretations (2023).
• LLM judges fall for rhetorical *surface* — ornate formatting, authority signals — without evaluating substance; they confuse recognition with real evaluation (2024–2025).
• Structured prompting forcing explicit warrant-checking (argumentation-scheme critical questions) partially closes the gap, suggesting the deficit is partly architectural (2025).
• Debiasing LLMs often eliminates their ability to produce creative or rhetorical output, raising a trade-off question (2024).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 (2023): Ambiguity modeling failure
• arXiv:2402.10669 (2024): Judge bias susceptibility
• arXiv:2412.15177 (2025): Argumentation-scheme prompting improves reasoning
• arXiv:2511.20471 (2026): Creative reasoning under multiple-path exploration

Your task:
(1) RE-TEST EACH CONSTRAINT. For models released or fine-tuned since mid-2025 (especially o1, Claude 3.5, Llama 3.2 variants, or MoE ensembles): do larger context windows, chain-of-thought training, or retrieval-augmentation relax the ambiguity-holding bottleneck? Does structured prompting (e.g., explicit dilemma frames, multi-stakeholder prompts) now enable recognition of irony *and* intent-detection? Separate what remains architecturally hard from what was merely prompt-engineering oversight.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any 2026+ paper show that scale or method *does* unify recognition and production? Or conversely, prove they're mutually exclusive?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If structured prompting closes the gap, what minimal constraint set forces a model to *produce* rhetoric it can recognize — and is that production genuine or still surface-imitation? (b) Can a probe distinguish whether a model's "recognition" of irony reflects genuine dual-valence parsing or pattern-matching to training-set metalanguage?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI can label every irony and metaphor it reads — but that doesn't mean it can actually use one.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8