INQUIRING LINE

Can LLMs identify implicit metaphoric mappings that require pragmatic inference?

This explores whether LLMs can recover the unstated 'this maps to that' relationship behind a metaphor — the kind of figurative meaning you can only reach by inferring what the speaker intends, not by decoding the literal words.


This explores whether LLMs can recover the unstated mapping behind a metaphor — the figurative leap that only works if you infer what's meant rather than what's said. The corpus reframes the question in a useful way: metaphor isn't a special category to be detected so much as one instance of a broader pragmatic skill. The Diplomat dataset bundles metaphors, idioms, and puns into a single task — recovering literal meaning from non-literal expression — and argues that what LLMs lack is general semantic decoupling, the ability to separate what a phrase says from what it's doing Can one model handle all types of figurative language?. So the real question isn't 'can the model spot a metaphor' but 'can it infer intended meaning against literal meaning' — which is exactly where pragmatic inference lives.

The discouraging part is what sits underneath that decoupling ability. When semantic content is pulled apart from the reasoning task, LLM performance collapses even when the correct rules are handed to the model in context — they lean on parametric commonsense and token associations rather than manipulating relationships symbolically Do large language models reason symbolically or semantically?. Metaphoric mapping is precisely a structural operation (carry the relations from one domain onto another), so a model that reasons through familiar semantic association rather than structure is likely to handle conventional, high-frequency metaphors well and stumble on novel ones. A related finding sharpens this: models systematically prefer textually frequent phrasings over semantically equivalent rare ones, tracking statistical mass from pretraining rather than meaning Do language models really understand meaning or just surface frequency?. A fresh, implicit mapping has no statistical mass to ride on.

Pragmatic inference also demands holding more than one reading at once — the literal and the intended — and choosing between them by what the speaker must have meant. Here the corpus is blunt: GPT-4 correctly disambiguates only 32% of deliberately ambiguous cases versus 90% for humans, a failure spanning lexical, structural, and scope ambiguity, suggesting models can't hold multiple interpretations simultaneously Can language models recognize when text is deliberately ambiguous?. Metaphor is a managed ambiguity — you're meant to notice the literal reading is wrong and reach past it — so this gap cuts directly at the inferential move the question asks about. Relatedly, entailment work shows models often decide based on whether a conclusion looks familiar from training rather than whether the premise actually supports it Do LLMs predict entailment based on what they memorized?.

The most interesting wrinkle is that explaining a metaphor and using one may be different circuits entirely. 'Potemkin understanding' describes models that explain a concept correctly, fail to apply it, and even recognize the failure — a pattern that implies explanation and execution run on functionally disconnected pathways Can LLMs understand concepts they cannot apply?. So a model might fluently define what 'time is a thief' maps onto while still missing the mapping when it has to act on a fresh one in conversation. That distinction — articulate-about versus operate-with — is the thing worth carrying away here.

A quieter counterweight: LLMs do learn meaning as a fully relational system, compressing cultural and discourse structure from text without any external referent Can language models learn meaning without engaging the world?. That's genuinely how a lot of conventional metaphor works — 'argument is war,' 'up is good' are linguistic conventions, not perceptions — so for the metaphors already sedimented into language, relational compression may be enough. The pragmatic-inference frontier is the novel mapping, built on the fly, that the corpus suggests current models are least equipped to reach.


Sources 7 notes

Can one model handle all types of figurative language?

The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a pragmatic-reasoning researcher. The question: Can LLMs identify implicit metaphoric mappings that require pragmatic inference — i.e., recover unstated domain-to-domain structural transfers that depend on inferring intent against literal meaning?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library identified these constraints:

• Metaphoric mapping is a structural operation; LLMs reason through familiar semantic association (token-based) rather than symbolic manipulation, so novel metaphors collapse while conventional, high-frequency ones succeed (~2023).
• Models disambiguate only 32% of deliberately ambiguous cases vs. 90% for humans, failing to hold multiple readings simultaneously — a capacity metaphor (managed ambiguity) directly requires (~2023).
• LLMs systematically prefer textually frequent phrasings over semantically equivalent rare ones, tracking statistical mass from pretraining; novel implicit mappings have no statistical mass to ride on (~2024–2026).
• 'Potemkin understanding' appears: models explain a metaphor correctly yet fail to apply it, suggesting explanation and execution run on disconnected circuits (~2023).
• When semantic content is decoupled from reasoning tasks, performance collapses even with correct rules provided in context (~2023).

Anchor papers (verify; mind their dates):
- 2023-06: arXiv:2306.09030 (Diplomat — pragmatic reasoning as unified task)
- 2023-05: arXiv:2305.14825 (in-context semantic vs. symbolic reasoning)
- 2023-04: arXiv:2304.14399 (ambiguity modeling)
- 2026-02: arXiv:2602.06176 (reasoning failures, recent synthesis)

Your task:
(1) RE-TEST EACH CONSTRAINT. For novel metaphor generation or comprehension in frontier models (GPT-4o, o1, Claude 3.5+): does prompt engineering, chain-of-thought forcing, or latent-space reasoning (arXiv:2412.06769, arXiv:2511.20471) now enable structural transfer? Does long-context (arXiv:2406.13121) help hold multiple readings? Separate durable failure (structural vs. token-based reasoning) from perishable limitation (training, prompting, or architecture fix).
(2) Surface strongest contradicting or superseding work from last ~6 months on metaphor, pragmatic inference, or creative/abductive reasoning in LLMs.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can mechanistic interpretability (arXiv:2507.08017) reveal whether newer models have learned to decouple literal from intended meaning? (b) Does reasoning in continuous latent space (arXiv:2412.06769) bypass the token-frequency bottleneck for novel mappings?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines