INQUIRING LINE

Why can LLMs identify argument structure but not check warrants?

This explores why LLMs reliably spot the parts of an argument (claim, evidence, structure) yet stumble when they have to check the unstated assumption — the warrant — that actually licenses the leap from evidence to claim.


This explores why LLMs reliably spot the parts of an argument — claim, evidence, the shape of the inference — yet stumble at checking the warrant, the unstated assumption that actually licenses the jump from evidence to conclusion. The corpus points to a clean split: identifying structure is a surface-pattern task, while checking a warrant requires bringing the right piece of world knowledge forward as a constraint at the right moment — and that second move is where these models systematically break. One study finds exactly this gap: models correctly label claims and evidence but fail at supplying or evaluating the implicit warrants connecting them, and the failure persists even when the surface structure is read correctly Can LLMs identify the hidden assumptions that make arguments work?. The telling detail is that it's not that the knowledge is absent — it's that the model doesn't access it in the argumentative moment.

That "knows it but doesn't use it" pattern shows up everywhere once you look. Models accommodate false presuppositions even when direct questioning proves they hold the correct fact Why do language models accept false assumptions they know are wrong?, and performance roughly halves on questions built on false assumptions Why do language models struggle with questions containing false assumptions?. The deepest version of this is the "frame problem": models fail not from missing world knowledge but from failing to enumerate the relevant background conditions — and forcing them to list those preconditions explicitly jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. A warrant is precisely an unstated background condition. So checking it requires the one operation LLMs don't do spontaneously: surfacing the implicit and testing it.

Why don't they do it spontaneously? Because the underlying machinery is semantic association, not symbolic verification. When reasoning is decoupled from familiar semantic content, performance collapses even with the correct rules sitting in context Do large language models reason symbolically or semantically?. Relatedly, models predict entailment based on whether a hypothesis looks attested in training rather than whether the premise actually supports it Do LLMs predict entailment based on what they memorized?, and they treat presupposition triggers as surface cues instead of computing their real semantic effect Why do embedding contexts confuse LLM entailment predictions?. Warrant-checking is a support relation — does this actually follow? — which is exactly the symbolic operation that attestation and surface-cue matching short-circuit.

The sharpest framing of the whole phenomenon is "Potemkin understanding": a model can explain a concept correctly, fail to apply it, and even recognize its own failure — a combination that suggests explanation and execution run on functionally disconnected pathways Can LLMs understand concepts they cannot apply?. Identifying argument structure draws on the explanation pathway (you can describe a warrant in the abstract); checking a specific warrant against the world draws on the execution pathway. The structure-vs-warrant gap is that disconnection viewed through argumentation.

The genuinely useful twist is that this gap is partly addressable from the prompt side. Turning Toulmin's model into explicit steps — making the model name warrants and backing rather than skip implicit premises — catches failures that ordinary chain-of-thought lets through Can structured argument prompts make LLM reasoning more rigorous?, the same lever that fixed the precondition-enumeration failure. Even argument-scheme classification, the "structure" task itself, only works with few-shot examples and paraphrased descriptions rather than zero-shot or formal definitions Can large language models classify argument schemes reliably? Why do paraphrased definitions work better than expert ones?. So the real story isn't "can't" but "won't unless forced": warrant-checking is latent and recoverable, but only when the prompt makes the implicit step mandatory instead of optional.


Sources 11 notes

Can LLMs identify the hidden assumptions that make arguments work?

LLMs successfully identify claims and evidence but significantly fail at supplying or evaluating the implicit warrants connecting them. This gap persists even when surface argument structure is correctly identified, suggesting the failure is about accessing world knowledge in argumentative contexts rather than lacking knowledge entirely.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models struggle with questions containing false assumptions?

The (QA)2 benchmark found that zero-shot LLMs halve their performance when questions contain false or unverifiable assumptions compared to valid questions. Even top models reached only 56% acceptability, and the gap persists despite model scaling, suggesting false presuppositions embedded in plausible language are systematically difficult to reject.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why do paraphrased definitions work better than expert ones?

LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Why can LLMs identify argument structure but not check warrants—and has this gap narrowed or persisted?

What a curated library found — and when (dated claims, not current truth): Findings span 2017–2026; treat these as perishable constraints to re-test.
• Models correctly label claims and evidence but fail to supply or evaluate implicit warrants connecting them, even when surface structure is recognized (~2017–2024).
• The failure is not missing knowledge; it's knowledge the model holds but doesn't access in the argumentative moment. Forcing explicit precondition enumeration lifts accuracy from ~30% to ~85% (~2023–2024).
• LLMs are semantic reasoners, not symbolic reasoners: they predict entailment via hypothesis attestation in training rather than via premise-support relations; presupposition triggers are surface cues, not semantic effects (~2023).
• Warrant-checking is recoverable via prompt design: explicit Toulmin steps and argument-scheme critical-questions as structured prompts catch failures that chain-of-thought misses (~2024–2025).
• Argument-scheme classification itself succeeds only with few-shot examples and paraphrased descriptions, not zero-shot or formal definitions (~2024).

Anchor papers (verify; mind their dates):
• arXiv:1708.01425 (2017) — The Argument Reasoning Comprehension Task
• arXiv:2305.14825 (2023) — In-Context Semantic vs. Symbolic Reasoners
• arXiv:2412.15177 (2024) — Critical-Questions-of-Thought: Argumentative Querying
• arXiv:2506.08952 (2025) — Grounding and Loaded Questions

Your task:
(1) RE-TEST THE "KNOWS BUT DOESN'T USE" CONSTRAINT: Does newer model scale, instruction-tuning, chain-of-thought variants (e.g., tree-of-thought, graph-of-thought), or tool use (formal logic verifiers, external warrant databases) now spontaneously surface and test implicit warrants? Separate the durable question—why symbolic verification remains hard—from the perishable limitation—that prompting can't unlock it. Cite what unlocked it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months: any evidence that warrant-checking is now solved, or that semantic reasoning suffices, or that the structure–execution disconnect no longer holds.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If prompting or fine-tuning now makes warrant-checking reliable, why, and what is the minimal intervention? (b) Does the "Potemkin understanding" frame still hold, or do newer models show converged explanation–execution pathways?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines