INQUIRING LINE

Why do LLMs perform better on explicit discourse connectives than implicit relations?

This explores why LLMs handle discourse relations marked by words like 'because' or 'however' but stumble when the same logical relationship is left unsaid — and what that gap reveals about how these models actually process meaning.


This explores why LLMs handle discourse relations marked by words like 'because' or 'however' but collapse when the same logical relationship is implied rather than stated. The short answer the corpus keeps circling back to: LLMs read the signpost, not the road. When a connective is present, the model has a surface token to latch onto; when the relationship has to be inferred from the meaning of two clauses, performance falls off a cliff. One study finds ChatGPT does fine on explicit relations but drops to 24.54% accuracy on implicit ones — strong evidence that discourse 'competence' here is really pattern-matching on visible cues, not structural understanding of how ideas connect Why does ChatGPT fail at implicit discourse relations?.

The same shape shows up when you slice discourse a different way. LLMs are noticeably better at causal reasoning than temporal reasoning, and the reason is the same mechanism: causal links tend to come with explicit, frequent connectives in training text, while temporal order is usually left implicit and must be reconstructed from context Why do LLMs handle causal reasoning better than temporal reasoning?. So 'explicit vs. implicit' isn't a quirk of one task — it's a fault line that runs through the model's whole relationship with language. Where the signal is on the surface, the model thrives; where it has to be computed, it guesses.

What's interesting is that this isn't only a discourse problem — it's the same failure dressed in different clothes across the corpus. LLMs treat presupposition triggers and non-factive verbs as surface cues instead of computing their actual effect on what's entailed Why do embedding contexts confuse LLM entailment predictions?, and they'll accept false presuppositions even when they demonstrably know the correct fact Why do language models accept false assumptions they know are wrong?. Grammatical competence degrades predictably as sentences get more structurally complex — embedded clauses and recursion break the model where simple sentences don't Does LLM grammatical performance decline with structural complexity?. In every case the diagnosis is identical: statistical learning captures the visible marker but not the underlying structure it points to.

The deeper framing is that LLMs reason through semantic association, not symbolic manipulation. When you strip the familiar semantic content out of a reasoning task and leave only the logical rules, performance collapses — the model was never running the inference, it was riding the token associations Do large language models reason symbolically or semantically?. An explicit connective is exactly the kind of high-frequency association the model has memorized; an implicit relation demands the symbolic inference it doesn't actually do. One synthesis maps these breakdowns specifically to discourse intentionality and attention layers, suggesting the gap isn't just about surface vocabulary but about how the architecture allocates attention across a passage Where exactly do language models fail at structural language tasks?.

Here's the thing you might not have expected to find: the limitation may be more about training signal than raw capacity. With explicit chain-of-thought prompting, models can construct genuine syntactic trees and metalinguistic analyses they fail at in normal use Can language models actually analyze language structure?, and a related conversational gap — ignoring distractors — closes after fine-tuning on barely a thousand examples Why do language models engage with conversational distractors?. So the explicit/implicit asymmetry isn't necessarily a hard ceiling. It may be that implicit relations are simply underrepresented as a learnable signal — the model never got enough reason to compute what it could instead just read off the surface.


Sources 9 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a discourse linguist stress-testing claims about LLM language understanding. The question: Why do LLMs perform better on explicit discourse connectives than implicit relations—and is this a hard architectural limit or a training/inference artifact that newer methods have since relaxed?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–present. A library distilled these patterns:
• ChatGPT drops to 24.54% accuracy on implicit discourse relations vs. near-baseline on explicit ones; the gap reflects pattern-matching on surface tokens, not structural understanding (2023–24).
• Causal reasoning outperforms temporal reasoning by the same mechanism: causal connectives are frequent and explicit in training text, temporal order is implicit and must be reconstructed (~2025).
• LLMs treat presupposition triggers, non-factive verbs, and embedded clauses as surface cues rather than computing their actual logical effect; grammatical competence degrades predictably with structural complexity (2023–24).
• Chain-of-thought prompting and minimal fine-tuning (~1K examples) can close related inference gaps, suggesting the limitation may reflect *underrepresented signal* in pretraining rather than incapacity (2023–24).
• Multi-turn conversation and topic-following remain fragile even in SOTA models, hinting the implicit/explicit asymmetry persists across discourse scales (~2025).

Anchor papers (verify; mind their dates):
• 2305.14825 (May 2023): LLMs are in-context semantic reasoners, not symbolic reasoners.
• 2503.19260 (Mar 2025): Linguistic Blind Spots of Large Language Models.
• 2502.10215 (Feb 2025): Do Large Language Models Reason Causally Like Us? Even Better?
• 2507.08017 (Jul 2025): Mechanistic Indicators of Understanding in Large Language Models.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the explicit/implicit gap, separate the *durable question* (do LLMs struggle with implicit relations?) from *perishable claims* (that this reflects fixed architectural blindness vs. recoverable training signal). Judge whether post-2025 work on retrieval-augmented generation, structured prompting, mechanistic probing, or parameter-efficient fine-tuning has genuinely relaxed the gap. Cite what relaxed it; note plainly where the constraint still holds.
(2) Surface the strongest work from the last 6 months that either *contradicts* the "semantic association, not symbolic inference" diagnosis or *supersedes* it with a finer model of how LLMs handle discourse. Flag disagreement within the library itself (e.g., if causal reasoning papers contradict 2305.14825).
(3) Propose 2 research questions that *assume* the regime may have moved: e.g., "If CoT or retrieval can close the gap, what minimum training signal is needed to make implicit relations learnable?" or "Do emergent discourse capabilities appear at specific scale/architecture milestones?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines