INQUIRING LINE

Why do explicit discourse connectives work when implicit relations fail?

This explores why LLMs handle discourse relations well when a linking word like 'because' or 'but' is on the page, yet fall apart when the same relationship is left unstated — and what that gap reveals about how these models actually process meaning.


This explores why LLMs handle discourse relations well when a linking word like "because" or "but" is on the page, yet collapse when the same relationship has to be inferred. The short version from the corpus: explicit connectives are *surface signals the model can pattern-match*, while implicit relations require actually reasoning about meaning — and that reasoning is the thing these models mostly don't do. ChatGPT performs strongly on explicit discourse relations but drops to roughly 24% accuracy on implicit ones, which is a striking tell that its competence rides on the connective itself, not on understanding the semantic content underneath Why does ChatGPT fail at implicit discourse relations?.

The same asymmetry shows up wherever the cue is explicit versus inferred. Causal reasoning beats temporal reasoning in LLMs for exactly this reason — causal connectives ("because," "therefore") are frequent and explicit in training text, while temporal order is usually left implicit and has to be reconstructed from context Why do LLMs handle causal reasoning better than temporal reasoning?. Zoom out and it's a general pattern: models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and anything requiring forward-planning across a discourse Where exactly do language models fail at structural language tasks?. The connective isn't just a hint — it's load-bearing scaffolding the model leans on instead of building its own.

What makes this more than a quirk is that the failure isn't ignorance — it's a refusal to compute structure that's present. Models treat presupposition triggers and non-factive verbs as surface cues rather than working out their actual semantic effect on entailment, so embedding contexts become systematic blind spots Why do embedding contexts confuse LLM entailment predictions?. They'll accommodate a false presupposition even when a direct question proves they know the correct fact Why do language models accept false assumptions they know are wrong?, and they fail to adjust scalar implicatures to conversational context the way humans reflexively do Can language models adapt implicature to conversational context?. In each case the knowledge is there; what's missing is the structural inference step that an explicit marker would otherwise spare them from taking.

This connects to a deeper claim worth pulling forward: chain-of-thought reasoning shows the same signature. CoT works by constraining the model to reproduce familiar reasoning *forms* from training rather than performing novel inference, and it degrades under distribution shift — the fingerprint of imitation, not capability Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Explicit connectives are essentially the discourse-level version of that crutch: a learned form the model can echo. Tasks that demand integrating an inferential pattern across distributed spans — argument scheme classification, for instance — plateau far below tasks with local surface features Why does argument scheme classification stumble where other NLP tasks succeed?.

The thing you might not have expected to learn: this whole pattern is arguably what language models *are*, not a bug to be patched. One line of thinking holds that LLMs operationalize Saussure's *langue* — they compress the relational structure of text without any external referent or grounding Can language models learn meaning without engaging the world?. An explicit connective lives inside that relational system; an implicit relation points outside it, to inference about a world the model never touches. Seen that way, the explicit/implicit gap isn't a quirk of one benchmark — it's the visible seam between pattern-completion and the structural understanding these systems were never built to have.


Sources 9 notes

Why does ChatGPT fail at implicit discourse relations?

ChatGPT performs well on explicit discourse relations with connectives but achieves only 24.54% accuracy on implicit relations without them. This asymmetry reveals that LLMs rely on surface signals rather than inferring meaning from semantic content.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a discourse semantics researcher probing why LLMs excel with explicit connectives but fail on implicit relations. The question remains: is this a trainable gap or a fundamental architectural ceiling?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as perishable constraints to be re-tested:
• ChatGPT achieves ~24% accuracy on implicit discourse relations vs. strong performance on explicit connectives (2023–2024); explicit markers act as surface scaffolds, not genuine semantic understanding.
• CoT reasoning is constrained imitation of familiar reasoning *forms* from training, not abstract inference; it degrades under distribution shift (~2025, arXiv:2506.02878).
• Models systematically fail to compute presupposition, scalar implicature, and non-factive verb effects in embedded contexts, even when the underlying knowledge is present (2023–2024).
• Implicit causal and temporal reasoning both plateau on tasks requiring integration across distributed spans; argument scheme classification shows this pattern most starkly (2024).
• A 2026 study flags surface heuristics as actively overriding implicit constraints in reasoning (arXiv:2603.29025).

Anchor papers (verify; mind their dates):
• arXiv:2305.14785 (2023-05): Blind spots in simple linguistic inference.
• arXiv:2401.05777 (2024-01): Probing structured semantics understanding.
• arXiv:2506.02878 (2025-06): CoT as constrained imitation.
• arXiv:2603.29025 (2026-03): Surface heuristics override implicit constraints.

Your task:
(1) RE-TEST THE EXPLICIT/IMPLICIT ASYMMETRY. For each constraint above, determine whether newer models (o1, o3, Claude 3.5+), multi-turn scaffolding (iterative questioning, memory), mechanistic probing, or fine-tuning have narrowed or dissolved the gap. Separate the durable question (do LLMs genuinely infer implicit relations?) from the perishable limitation (maybe they can, given new training or orchestration). Ground any advances in concrete benchmarks and arXiv IDs.
(2) Surface the strongest DISAGREEMENT. The library hints that this gap may be *architectural*, not empirical. Find papers from the last 6 months arguing LLMs *do* reason over implicit structure, or that the gap is an artifact of evaluation, not capability.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can explicit connectives be *learned* via mechanistic intervention rather than training? (b) Do multi-agent or iterative orchestration effectively *synthesize* implicit relations that single-pass inference misses?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines