INQUIRING LINE

Can LLM-generated descriptions of schemes outperform formal dictionary definitions for prompting?

This explores whether LLM-paraphrased descriptions of argument schemes beat the formal Walton dictionary definitions when you put them in a classification prompt — and why that would even happen.


This explores whether casual, LLM-written descriptions of argument schemes outperform the precise expert definitions experts wrote — and the corpus says yes, with a clear reason why. When models classify arguments by scheme (appeal to authority, cause-to-effect, and so on), feeding them LLM-generated paraphrases of each scheme works better than feeding them Walton's formal logical definitions Why do paraphrased definitions work better than expert ones?. The mechanism isn't that the paraphrases are clearer to humans — it's that they're closer to the model's own training distribution. Formal logical vocabulary is rare and stilted relative to the everyday language LLMs absorbed during pretraining, so a definition phrased the way the model already 'thinks' lands better than one phrased the way a logician would write it.

That connects to a deeper claim running through the collection: prompting can only reorganize and activate knowledge the model already has, never inject what's missing Can prompt optimization teach models knowledge they lack?. A paraphrase outperforming an expert definition is exactly what you'd predict from that — neither version adds new knowledge, but the paraphrase is a better key to unlock what's latent because it's written in the model's native dialect. The win is a retrieval-of-internal-knowledge effect, not a teaching effect.

There's also a reason formal definitions specifically underperform: LLMs reason through semantic association rather than symbolic logic Do large language models reason symbolically or semantically?. A Walton definition leans on formal logical structure (warrants, premises, defeasible inference), which is the kind of symbolic manipulation models are weakest at. A paraphrase swaps that scaffolding for semantic cues the model can pattern-match against — playing to its strength instead of its blind spot.

But don't over-read the result. The same line of work shows the paraphrase advantage is fragile: zero-shot prompting fails uniformly, and even with good descriptions you need few-shot examples, and even then only the larger models clear a mediocre bar (Claude tops out around F1 0.65, smaller models plateau near 0.53) Can large language models classify argument schemes reliably?. So 'paraphrase beats formal definition' is a real effect operating inside a hard ceiling — better phrasing helps, but it doesn't manufacture capability the model lacks.

The genuinely interesting takeaway: the best prompt isn't the most rigorous one. It's the one written in the language the model already speaks. That cuts against the intuition that more precise, more formal instructions should help — and it's the opposite of what a complementary thread in the corpus finds, where imposing explicit structured reasoning steps (like running an argument through critical questions) does improve rigor Can structured argument prompts make LLM reasoning more rigorous?. The reconciliation: structure helps when it guides the reasoning process, but formality hurts when it's just vocabulary the model never fluently absorbed.


Sources 5 notes

Why do paraphrased definitions work better than expert ones?

LLM-generated descriptions of argument schemes yield better classification performance than expert Walton definitions. The advantage stems from paraphrases matching the model's training distribution better than formal logical vocabulary.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can large language models classify argument schemes reliably?

Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM prompting strategy. The question: does casual LLM-generated phrasing of concepts outperform formal expert definitions for in-context task performance?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as snapshots, not current truth.
- LLM-generated paraphrases of argument schemes outperform Walton's formal logical definitions; best performer (Claude) achieved ~F1 0.65, smaller models ~0.53 (2024–25).
- The mechanism: paraphrases activate latent knowledge via training-distribution alignment, not by teaching new concepts; formal vocabulary is rare in pretraining, so it's a retrieval problem, not a comprehension one (2023–2024).
- LLMs reason via semantic association, not symbolic logic; formal logical structure (warrants, premises) activates their weakest mode; semantic paraphrases play to pattern-matching strength (2023).
- Zero-shot fails uniformly; even with good descriptions, few-shot examples and larger model scale are necessary; the paraphrase advantage is fragile (2024–25).
- Structured reasoning steps (critical-questions prompting) improve rigor and reasoning quality; structure as process guidance helps; formality as vocabulary hurts (2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.14825 (May 2023): In-context semantic reasoning, not symbolic.
- arXiv:2410.12405 (Oct 2024): Prompt sensitivity of LLMs across scales.
- arXiv:2412.15177 (Dec 2024): Critical-questions-of-thought steering.
- arXiv:2502.10708 (Feb 2025): Domain-specific knowledge injection survey.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above: does newer model scale (o1, gpt-4-turbo, Claude 3.5+), chain-of-thought training, or structured in-context learning (tree-of-thought, RAG over domain ontologies, multi-turn caching) relax the formality-vs.-paraphrase tradeoff? Has the F1 0.65 ceiling budged for argument-scheme classification? Does formal reasoning fine-tuning or reasoning-specific pretraining invert the paraphrase advantage? Cite what moved it, and state plainly where the constraint still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Does prompt-sensitivity research (arXiv:2410.12405) or knowledge-injection surveys (arXiv:2502.10708) undermine the 'native dialect' claim? 
(3) Propose 2 research questions that assume the regime may have shifted: (a) Does explicit instruction fine-tuning on formal logical reasoning reverse semantic-association dominance? (b) Can retrieval-augmented prompting (RAG + formal schema) bypass the training-distribution bottleneck?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines