Why can LLMs identify argument structure but not check warrants?
This explores why LLMs reliably spot the parts of an argument (claim, evidence, structure) yet stumble when they have to check the unstated assumption — the warrant — that actually licenses the leap from evidence to claim.
This explores why LLMs reliably spot the parts of an argument — claim, evidence, the shape of the inference — yet stumble at checking the warrant, the unstated assumption that actually licenses the jump from evidence to conclusion. The corpus points to a clean split: identifying structure is a surface-pattern task, while checking a warrant requires bringing the right piece of world knowledge forward as a constraint at the right moment — and that second move is where these models systematically break. One study finds exactly this gap: models correctly label claims and evidence but fail at supplying or evaluating the implicit warrants connecting them, and the failure persists even when the surface structure is read correctly Can LLMs identify the hidden assumptions that make arguments work?. The telling detail is that it's not that the knowledge is absent — it's that the model doesn't access it in the argumentative moment.
That "knows it but doesn't use it" pattern shows up everywhere once you look. Models accommodate false presuppositions even when direct questioning proves they hold the correct fact Why do language models accept false assumptions they know are wrong?, and performance roughly halves on questions built on false assumptions Why do language models struggle with questions containing false assumptions?. The deepest version of this is the "frame problem": models fail not from missing world knowledge but from failing to enumerate the relevant background conditions — and forcing them to list those preconditions explicitly jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. A warrant is precisely an unstated background condition. So checking it requires the one operation LLMs don't do spontaneously: surfacing the implicit and testing it.
Why don't they do it spontaneously? Because the underlying machinery is semantic association, not symbolic verification. When reasoning is decoupled from familiar semantic content, performance collapses even with the correct rules sitting in context Do large language models reason symbolically or semantically?. Relatedly, models predict entailment based on whether a hypothesis looks attested in training rather than whether the premise actually supports it Do LLMs predict entailment based on what they memorized?, and they treat presupposition triggers as surface cues instead of computing their real semantic effect Why do embedding contexts confuse LLM entailment predictions?. Warrant-checking is a support relation — does this actually follow? — which is exactly the symbolic operation that attestation and surface-cue matching short-circuit.
The sharpest framing of the whole phenomenon is "Potemkin understanding": a model can explain a concept correctly, fail to apply it, and even recognize its own failure — a combination that suggests explanation and execution run on functionally disconnected pathways Can LLMs understand concepts they cannot apply?. Identifying argument structure draws on the explanation pathway (you can describe a warrant in the abstract); checking a specific warrant against the world draws on the execution pathway. The structure-vs-warrant gap is that disconnection viewed through argumentation.
The genuinely useful twist is that this gap is partly addressable from the prompt side. Turning Toulmin's model into explicit steps — making the model name warrants and backing rather than skip implicit premises — catches failures that ordinary chain-of-thought lets through Can structured argument prompts make LLM reasoning more rigorous?, the same lever that fixed the precondition-enumeration failure. Even argument-scheme classification, the "structure" task itself, only works with few-shot examples and paraphrased descriptions rather than zero-shot or formal definitions Can large language models classify argument schemes reliably? Why do paraphrased definitions work better than expert ones?. So the real story isn't "can't" but "won't unless forced": warrant-checking is latent and recoverable, but only when the prompt makes the implicit step mandatory instead of optional.
Sources 11 notes
LLMs successfully identify claims and evidence but significantly fail at supplying or evaluating the implicit warrants connecting them. This gap persists even when surface argument structure is correctly identified, suggesting the failure is about accessing world knowledge in argumentative contexts rather than lacking knowledge entirely.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
The (QA)2 benchmark found that zero-shot LLMs halve their performance when questions contain false or unverifiable assumptions compared to valid questions. Even top models reached only 56% acceptability, and the gap persists despite model scaling, suggesting false presuppositions embedded in plausible language are systematically difficult to reject.
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.