INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How faithfully do LLMs reflect the…›this inquiring line

Reducing argument categories from dozens to nine barely helps AI classify them — how the categories are described matters far more.

Does compressing Walton's schemes into nine categories make LLM classification easier?

This explores whether collapsing Walton's many fine-grained argument schemes into a smaller set of nine categories actually helps LLMs classify them — and the corpus suggests the bottleneck isn't the number of categories so much as how the categories are described and how LLMs represent meaning.

This explores whether shrinking Walton's sprawling taxonomy of argument schemes down to nine buckets makes the classification job easier for a language model. The corpus doesn't run that exact experiment, but several of its findings point in a clear direction: fewer categories may help a little, but the real levers are elsewhere. The most direct evidence is that LLMs only classify argument schemes acceptably under narrow conditions — few-shot examples plus scheme descriptions — and even then larger models barely clear F1 0.55, with Claude topping out around 0.65, while smaller models plateau near 0.53 regardless (Can large language models classify argument schemes reliably?). That plateau looks like a representational ceiling, not a category-count problem, which hints that compressing the label set alone won't unlock much.

What *does* move the needle is surprising: LLM-generated paraphrases of the schemes outperform Walton's own formal definitions (Why do paraphrased definitions work better than expert ones?). The reason is that paraphrases sit closer to the model's training distribution than formal logical vocabulary does. So if you're going to compress nine categories, the win comes less from *how few* the categories are and more from *how the categories are worded* — describe each in the model's native idiom rather than in expert logic-speak.

There's a deeper reason compression is double-edged. LLMs already compress concepts far more aggressively than humans do, capturing broad category structure while shedding the fine-grained distinctions humans preserve for situated meaning (Do LLMs compress concepts more aggressively than humans do?). Folding Walton's schemes into nine categories plays *to* that tendency — coarser buckets match what the model naturally retains. But it cuts the other way too: if the nine categories still require distinguishing arguments by their underlying logical form, the model may struggle, because it reasons through semantic association rather than symbolic structure. When meaning is stripped away and only the logical skeleton remains, LLM reasoning collapses even with the correct rules in hand (Do large language models reason symbolically or semantically?).

That's the catch worth knowing: argument schemes are partly *formal* objects, and LLMs are not formal reasoners. A model can even produce a flawless explanation of a scheme and then fail to apply it to an actual argument — a disconnect between knowing and doing that doesn't look like a human knowledge gap (Can LLMs understand concepts they cannot apply?). So compressing to nine categories likely helps most if those categories are semantically distinct (different topics, different vocabulary) and helps least if they're formally distinct but semantically overlapping. Fewer, well-paraphrased, semantically-separable categories is the configuration the corpus would bet on — not nine for nine's sake.

Sources 5 notes

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Why do paraphrased definitions work better than expert ones?

LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Probing Structured Semantics Understanding and Generation of Language Models via Question Answering2.49 match · arxiv ↗
Can Large Language Models Understand Argument Schemes?1.75 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey1.74 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens1.73 match · arxiv ↗
Large Language Model Reasoning Failures1.73 match · arxiv ↗
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models1.71 match · arxiv ↗
Word Meanings in Transformer Language Models1.67 match · arxiv ↗
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning0.91 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether compressing argument schemes into fewer categories genuinely eases LLM classification, or whether the constraint is perishable. A curated library (spanning 2023–2026) found the following — treat these as DATED claims:

**What a curated library found — and when:**
- LLMs classify argument schemes acceptably only under narrow conditions (few-shot + descriptions); even large models plateau near F1 0.65, smaller ones at 0.53, regardless of category count (~2023–2024).
- LLM-generated paraphrases of schemes outperform Walton's formal definitions, because paraphrases align with the model's training distribution (~2025).
- LLMs compress concepts far more aggressively than humans, shedding fine-grained distinctions; they are in-context *semantic* reasoners, not symbolic reasoners (~2023–2025).
- When meaning is stripped and only logical skeleton remains, LLM reasoning collapses; "Potemkin understanding" — correct explanation + failure to apply — is a distinct failure mode (~2024–2025).
- Few, well-paraphrased, semantically-distinct categories likely help more than compression alone (~synthesis, 2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2305.14825 (2023-05): In-context semantic vs. symbolic reasoning
- arXiv:2505.17117 (2025-05): Compression vs. meaning trade-offs in LLMs and humans
- arXiv:2507.08017 (2025-07): Mechanistic indicators of understanding
- arXiv:2603.23004 (2026-03): Reasoning and optimization under constraints

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, assess whether newer models (o1-scale, frontier instruction-tuned variants), in-context retrieval augmentation (semantic indexing, dynamic few-shot retrieval), chain-of-thought scaffolding, or formal-reasoning tooling (SAT solvers, symbolic interpreters) have since *relaxed* the ceiling or overturned the semantic-vs.-symbolic split. Separate the durable question ("Are LLMs form-blind reasoners?") from the perishable limitation ("F1 0.65 is the max"). State plainly where each constraint still holds.

(2) **SURFACE CONTRADICTING WORK.** Identify the strongest recent papers (last ~6 months) that either refute the semantic-reasoning-only model or show argument scheme classification working *without* paraphrase engineering.

(3) **PROPOSE 2 RESEARCH QUESTIONS** that assume the regime may have shifted: e.g., do multi-agent critique loops overcome symbolic-reasoning gaps? Does fine-tuning on argument-scheme exemplars + formal-logic tokens unlock category-count independence?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Reducing argument categories from dozens to nine barely helps AI classify them — how the categories are described matters far more.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8