Can LLMs improve at metaphor if they handle decoupled semantics better?
This explores whether LLMs' weakness at metaphor is really a symptom of a deeper problem — their difficulty separating a word's literal meaning from its statistical, frequency-driven surface form ("decoupled semantics").
This explores whether LLMs' weakness at metaphor is really a symptom of a deeper problem — their difficulty separating a word's literal meaning from its non-literal use, which researchers call decoupling semantics. The corpus suggests the answer is a qualified yes, but with a twist: metaphor failure isn't a standalone skill gap, it's the visible tip of how these models reason at all. One line of work reframes metaphor, idioms, and puns not as separate categories to be memorized but as a single pragmatic task — recovering literal meaning from non-literal expression — which implies that what LLMs lack is general semantic-decoupling ability rather than more metaphor-specific training data Can one model handle all types of figurative language?. If that's true, improving decoupling would lift metaphor along with everything else in the family.
But here's the catch the corpus keeps circling: LLMs may not have a clean "meaning" representation to decouple in the first place. When researchers strip the familiar semantic content out of reasoning tasks and leave the logical rules intact, model performance collapses — evidence that LLMs reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?. They also systematically prefer high-frequency phrasings over rarer but equivalent ones, suggesting they track statistical mass from pretraining more than meaning itself Do language models really understand meaning or just surface frequency?. Metaphor is precisely where this bites: novel literary metaphors are low-frequency and demand mapping one conceptual domain onto another, and that's exactly where comprehension degrades, while conventional, lexicalized metaphors (already baked into the training distribution) work fine Where does LLM metaphor comprehension actually break down?.
There's an even sharper diagnosis. Models can explain a concept correctly and then fail to apply it — a "potemkin" pattern where the explanation pathway and the execution pathway are functionally disconnected Can LLMs understand concepts they cannot apply?. So a model might define metaphor flawlessly and still mishandle a fresh one, because knowing-about and doing are wired separately. Relatedly, metaphor often requires holding the literal and figurative reading at once, and LLMs are strikingly bad at sustaining multiple interpretations — GPT-4 disambiguates only about a third of deliberately ambiguous cases versus ninety percent for humans Can language models recognize when text is deliberately ambiguous?. Decoupling semantics isn't just separating literal from figurative; it's keeping both live simultaneously, which the architecture resists.
Where the corpus gets generative is on what "better decoupling" might concretely look like. Pure relational compression of text is enough to learn fluent, culturally situated language without any external grounding Can language models learn meaning without engaging the world? — but fluency clearly isn't the same as the conceptual mapping novel metaphor needs. A more promising hint comes from work showing that partial symbolic augmentation beats both raw language and full formalization: selectively adding structure while preserving semantic richness yields the gains, because full formalization throws away the very nuance metaphor depends on Why does partial formalization outperform full symbolic logic?. And metaphor may not even live in the "reasoning" bucket current methods optimize — it leans on transformational and exploratory creative reasoning that existing LLM reasoning techniques simply don't target Can LLMs reason creatively beyond conventional problem-solving?.
The thing you might not have expected: chasing metaphor directly is probably the wrong move. The corpus points to metaphor as a stress test for a general capacity — separating meaning from statistical surface form while holding competing readings open — and the most credible levers (selective symbolic scaffolding, creative-reasoning paradigms, closing the explain-versus-apply gap) all aim at that underlying capacity rather than at metaphor itself. Improve decoupled semantics and metaphor improves as a side effect; train on metaphor alone and you mostly teach the model more conventional metaphors to pattern-match.
Sources 9 notes
The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
LLMs handle conventional, lexicalized metaphors but fail on novel literary metaphors requiring conceptual domain mapping. This degradation reveals a fundamental gap between pattern recognition and genuine semantic mapping.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.