Why does argument scheme classification stumble where other NLP tasks succeed?
Explores whether the abstract, relational nature of argument schemes makes them harder to classify than concrete argument components or stance. Matters because understanding this difficulty gap could improve scheme recognition systems.
Argument-mining NLP tasks divide along a hidden axis of difficulty. Identifying argument components (claim, premise, warrant) is a span-tagging task — the unit is a piece of text, and the cues are positional and lexical. Identifying stance is a sentence-level classification task — the cues are sentiment and polarity. Identifying argument schemes in Walton's taxonomy is categorically harder because the unit of recognition is not a piece of text but a pattern of reasoning linking premises to a conclusion through a specific inferential move.
The empirical signature of this difficulty is a flat plateau around F1 0.55–0.65 across both pretrained language models and modern LLMs. BERT achieves F1 0.53; the strongest large model reaches 0.65 in the most favorable configuration. The same models that classify stance and tag argument components well above 0.80 stall on schemes. This is not a scaling issue alone — it is an evidence that scheme recognition requires integrating multiple text spans (premises and conclusion) and reasoning about the inferential bridge between them.
The cognitive-load framing predicts further failure modes. Tasks where the recognition target is a relation among text segments (rather than a property of a single segment) should consistently underperform tasks where recognition is local. Argument scheme classification is one instance; others include rhetorical relation classification in RST, discourse coherence relations, and counterfactual implication. The shared structure is that the evidence for the label is distributed across the input and requires integration.
The practical implication is that argument scheme labels are not yet a reliable feature for downstream pipelines. Systems that need scheme-aware behavior (dialectical evaluation, legal reasoning, value alignment dialogues) should either restrict to a smaller set of schemes with strongest classification performance, or use schemes' critical questions as a probing structure rather than relying on classification.
Inquiring lines that use this note as a source 28
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can you separate grammatical competence from rhetorical commitment in language systems?
- How does evaluative stance differ from structural argument analysis?
- What is the difference between learning discourse patterns and learning abstract language?
- How deeply are ideological structures represented in large language models?
- What makes intent taxonomies unmanageable at hundreds of intents?
- Why do discourse failures cluster in attention and intentional layers rather than linguistics?
- Why do explicit discourse connectives work when implicit relations fail?
- Why does standard RAG succeed for evidence-based but fail for debate questions?
- What distinguishes contrasting aspects from related aspects in question structure?
- How do comparison and debate questions differ in their aspect retrieval needs?
- Can the eight-dimension rubric predict which question types need decomposition?
- What role does discourse structure play in determining at-issueness?
- Can hierarchical key point structures improve opinion summarization?
- How do the six trap categories map onto detection difficulty?
- Why do smaller LLMs fail at zero-shot argument scheme classification?
- Why does scheme classification require more cognitive load than identifying premises?
- Do scheme critical questions work better than direct scheme classification prompts?
- What are the three orthogonal axes that structure the argument scheme periodic table?
- How does the first-order and second-order distinction unify classical and modern argument theory?
- Why do LLM descriptions of argument schemes work better than formal definitions for classification?
- Can smaller scheme inventories or critical questions replace direct scheme classification?
- What failure modes emerge when scheme classification feeds downstream reasoning pipelines?
- What are the nine possible proposition-type combinations in arguments?
- Can argumentation structure improve reasoning through decomposition alone?
- Does argument-scheme prompting improve reasoning in non-code domains the same way?
- What makes an argument fallacious according to formal linguistic criteria?
- Can formal argumentation structure replace ad-hoc fallacy classifications?
- Do computational systems need formal argument analysis for explainability?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can large language models classify argument schemes reliably?
Explores whether LLMs can recognize Walton's 60+ argument schemes—abstract patterns of reasoning rather than surface features—and what conditions enable accurate classification.
same paper, the empirical evaluation
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
analogous: integrative reasoning tasks behave differently from local-pattern tasks
-
Can structured argument prompts make LLM reasoning more rigorous?
Does requiring language models to explicitly check warrants, backing, and rebuttals—rather than reasoning freely—improve reasoning quality and catch failures that standard step-by-step prompting misses?
the workaround: use scheme structure to drive reasoning rather than as a classification target
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can Large Language Models Understand Argument Schemes?
- A Robustness Evaluation Framework for Argument Mining
- Can Language Models Recognize Convincing Arguments?
- Exploring the Potential of Large Language Models in Computational Argumentation
- Argument Quality Assessment in the Age of Instruction-Following Large Language Models
- Constructing a Periodic Table of Arguments
- Computational Modelling of Undercuts in Real-world Arguments
- Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
Original note title
argument scheme classification carries higher cognitive load than other argument NLP tasks because schemes are abstract presumptive patterns not surface features