INQUIRING LINE

Can language models adapt irony detection to specific communicative contexts?

This explores whether LLMs can detect irony in a context-sensitive way — adjusting to who's speaking, the stakes, and the situation — rather than flagging irony from surface patterns alone.


This explores whether LLMs can detect irony in a way that bends to context — the speaker, the stakes, the situation — rather than spotting it as a fixed textual pattern. The short answer the corpus suggests is: not well, and the reason is more interesting than "irony is hard." Models clearly recognize irony as a *pattern*, but they miscalibrate *when* it applies. GPT-4o assigns significantly higher irony scores than humans do, systematically overestimating how often text is meant ironically — likely because ironic examples are more salient in training data than in everyday use Do language models overestimate how often irony appears?. So the failure isn't blindness to irony; it's an inability to gauge its real-world prevalence in a given context.

That points to a deeper, shared deficit across pragmatic tasks: tracking communicative *stakes*. The most direct neighbor here is scalar implicature, where ChatGPT shows no context-sensitivity at all — it doesn't shift its inferences for explicit literal-mode instructions, for information focus, or for face-threatening situations, even though humans flexibly modulate all three Can language models adapt implicature to conversational context?. Irony and implicature are cousins: both require reading past the literal words to recover intended meaning, and both demand sensitivity to who's talking and why. If a model can't adapt implicature to context, expecting it to adapt irony detection to context is asking the same competence under a different name.

Why does context bounce off? One mechanism is that strong training-time associations override what's actually in front of the model — LMs generate outputs inconsistent with their context because parametric priors dominate in-context information, and prompting alone can't reliably fix it Why do language models ignore information in their context?. That's exactly the irony-overestimation story at the representational level: the prior ("irony looks like this") wins over the local signal ("but here it's sincere"). A related limitation is that models struggle to hold multiple readings of the same text at once — GPT-4 disambiguates deliberately ambiguous sentences only 32% of the time versus 90% for humans Can language models recognize when text is deliberately ambiguous?. Irony is fundamentally a two-reading phenomenon (literal vs. intended); if a model can't keep both interpretations live, context-adaptive irony detection has nowhere to stand.

Here's the lateral move worth knowing: a promising reframe is to stop treating irony as its own classification problem. The Diplomat dataset folds metaphor, idiom, and pun into a single pragmatic task — recover the literal meaning behind a non-literal expression — and argues that what models need is better *semantic decoupling*, not more category-specific irony data Can one model handle all types of figurative language?. Under that lens, "adapt irony to context" stops being a niche skill and becomes one instance of a general pragmatic-reasoning ability the corpus repeatedly finds underdeveloped.

Finally, part of why this competence stays weak may be a training-signal problem, not a ceiling. Models are rewarded for predicting information, not for the relational, socially-attuned work that pragmatics actually is — the same reason they don't develop conversation-maintenance moves like reference repair Why don't language models develop conversation maintenance skills?, and the same reason they default to face-saving avoidance instead of context-appropriate correction even when they hold the right knowledge Why do language models avoid correcting false user claims?. Irony detection is social calibration; if the objective never rewards reading the social situation, the model learns the pattern but not the judgment of when it applies.


Sources 7 notes

Do language models overestimate how often irony appears?

GPT-4o assigns significantly higher irony scores than humans (p < .001), revealing that LLMs detect irony as a pattern but miscalibrate its prevalence because ironic examples are more salient in training data than in actual use.

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can one model handle all types of figurative language?

The Diplomat dataset (4,177 dialogues) reframes metaphors, idioms, and puns as one pragmatic task: recovering literal meaning from non-literal expression. This framing suggests LLMs need better semantic decoupling ability, not more category-specific training data.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a pragmatics researcher testing whether LLMs can adapt irony detection to specific communicative contexts — a capability that remains contested. A curated library (2022–2026) found these dated claims, now ripe for re-examination:

**What a curated library found — and when:**
- GPT-4o systematically *overestimates* ironic intent, assigning higher irony scores than humans; the failure is miscalibration of prevalence, not blindness to the pattern (~2024–2025).
- Models show *zero* context-sensitivity on scalar implicature — they ignore literal-mode instructions, information focus, and face-threatening cues that humans modulate flexibly (~2023).
- LMs struggle to hold multiple readings of ambiguous text simultaneously; GPT-4 disambiguates deliberately ambiguous sentences only 32% of the time versus 90% for humans (~2023).
- Strong parametric priors (training-time associations) override in-context information; prompting alone cannot reliably fix this override (~2024).
- Pragmatic reasoning may be unified across irony, metaphor, idiom, and pun — a *semantic decoupling* problem rather than category-specific failures (~2023).

**Anchor papers (verify; mind their dates):**
- arXiv:2304.14399 (2023) — ambiguity recognition
- arXiv:2306.09030 (2023) — Diplomat dataset, unified pragmatic reasoning
- arXiv:2410.12405 (2024) — prompt sensitivity
- arXiv:2506.08952 (2025) — grounding and knowledge gaps

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each claim above — overestimation bias, implicature blindness, two-reading deficit, prior override, unified pragmatics — determine whether *new* models (o1, Claude 3.5+), in-context learning methods (chain-of-thought, self-correction, retrieval-augmented reasoning), orchestration patterns (multi-agent debate, memory-grounded context windows), or fine-tuning / RLHF updates have *relaxed or overturned* the limitation. Separate the durable question (LMs adapting pragmatics to social context) from the perishable technical constraint (e.g., "models can't hold two readings"). Cite what relaxed it; name papers where the constraint still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has recent work on in-context calibration, constitutional AI, or pragmatic fine-tuning directly addressed irony-context binding?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If newer models *can* hold multiple readings, does context-adaptive irony detection emerge without explicit training?" or "Does social-stakes representation in training (e.g., RLHF on pragmatic grounding) unlock irony calibration?"

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines