Can LLM-generated descriptions of schemes outperform formal dictionary definitions for prompting?
This explores whether LLM-paraphrased descriptions of argument schemes beat the formal Walton dictionary definitions when you put them in a classification prompt — and why that would even happen.
This explores whether casual, LLM-written descriptions of argument schemes outperform the precise expert definitions experts wrote — and the corpus says yes, with a clear reason why. When models classify arguments by scheme (appeal to authority, cause-to-effect, and so on), feeding them LLM-generated paraphrases of each scheme works better than feeding them Walton's formal logical definitions Why do paraphrased definitions work better than expert ones?. The mechanism isn't that the paraphrases are clearer to humans — it's that they're closer to the model's own training distribution. Formal logical vocabulary is rare and stilted relative to the everyday language LLMs absorbed during pretraining, so a definition phrased the way the model already 'thinks' lands better than one phrased the way a logician would write it.
That connects to a deeper claim running through the collection: prompting can only reorganize and activate knowledge the model already has, never inject what's missing Can prompt optimization teach models knowledge they lack?. A paraphrase outperforming an expert definition is exactly what you'd predict from that — neither version adds new knowledge, but the paraphrase is a better key to unlock what's latent because it's written in the model's native dialect. The win is a retrieval-of-internal-knowledge effect, not a teaching effect.
There's also a reason formal definitions specifically underperform: LLMs reason through semantic association rather than symbolic logic Do large language models reason symbolically or semantically?. A Walton definition leans on formal logical structure (warrants, premises, defeasible inference), which is the kind of symbolic manipulation models are weakest at. A paraphrase swaps that scaffolding for semantic cues the model can pattern-match against — playing to its strength instead of its blind spot.
But don't over-read the result. The same line of work shows the paraphrase advantage is fragile: zero-shot prompting fails uniformly, and even with good descriptions you need few-shot examples, and even then only the larger models clear a mediocre bar (Claude tops out around F1 0.65, smaller models plateau near 0.53) Can large language models classify argument schemes reliably?. So 'paraphrase beats formal definition' is a real effect operating inside a hard ceiling — better phrasing helps, but it doesn't manufacture capability the model lacks.
The genuinely interesting takeaway: the best prompt isn't the most rigorous one. It's the one written in the language the model already speaks. That cuts against the intuition that more precise, more formal instructions should help — and it's the opposite of what a complementary thread in the corpus finds, where imposing explicit structured reasoning steps (like running an argument through critical questions) does improve rigor Can structured argument prompts make LLM reasoning more rigorous?. The reconciliation: structure helps when it guides the reasoning process, but formality hurts when it's just vocabulary the model never fluently absorbed.
Sources 5 notes
LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.