Why do different people reconstruct the same argument differently?
When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?
Argunauts (Argument Annotation Units) is a dataset and benchmark for argument reconstruction — extracting explicit logical structures from natural language arguments. The dataset's most significant finding is methodological: when multiple annotators (human and LLM) reconstruct the same argument independently, they produce different but equally valid reconstructions.
This is not annotation disagreement in the sense of noise to be resolved. Multiple reconstruction schemas — different choices about what counts as a premise, how to formalize the conclusion, what implicit assumptions to make explicit — are each internally valid. There is no gold standard because the text underdetermines the reconstruction.
This connects directly to Why do readers interpret the same sentence so differently? but at a structural rather than semantic level. Interpretive multiplicity in NLI is about meaning — what a sentence means depends on the reader's social position. Reconstruction multiplicity in argumentation is about structure — how an argument should be formalized depends on which reconstruction schema is applied.
Both findings converge on a challenge to the NLP assumption that language processing tasks have unique correct outputs. Do standard NLP benchmarks hide LLM ambiguity failures? describes how benchmarks respond to this problem by exclusion. For argumentation, exclusion is not possible — underdetermination is not a feature of edge cases but of the task itself.
The practical implication: evaluating LLMs on argument reconstruction requires acknowledging that precision and recall metrics assume ground truth that does not exist. Models that disagree with a reference annotation may be producing equally valid reconstructions. The field is measuring agreement with one valid interpretation and calling it correctness.
This also grounds Why do speakers deliberately use ambiguous language? from a new angle: structural ambiguity (multiple valid formalizations of the same argument) is as fundamental as semantic ambiguity.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do stakeholders interpret the same explanation differently in practice?
- Why does describing a process differ fundamentally from arguing about evidence?
- Why does the same recalled information lead to different reasoning conclusions?
- Why does regenerating LLM responses produce different but equally valid answers?
- Why does who makes an argument matter as much as what the argument says?
- What are the three orthogonal axes that structure the argument scheme periodic table?
- How do first-order and second-order arguments differ in formal structure?
- What are the nine possible proposition-type combinations in arguments?
- Can argumentation structure improve reasoning through decomposition alone?
- What makes an argument fallacious according to formal linguistic criteria?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do readers interpret the same sentence so differently?
How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.
semantic multiplicity; this is structural multiplicity; same root problem
-
Why do speakers deliberately use ambiguous language?
Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
the broader principle this exemplifies at the argument-structure level
-
Do standard NLP benchmarks hide LLM ambiguity failures?
When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
benchmark exclusion as the standard NLP response to underdetermination
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Argunauts: Open LLMs that Master Argument Analysis with Argdown
- The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants
- Argument Quality Assessment in the Age of Instruction-Following Large Language Models
- Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
- Computational Modelling of Undercuts in Real-world Arguments
- The Thin Line Between Comprehension and Persuasion in LLMs
- Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments
- AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts
Original note title
argument reconstruction is fundamentally underdetermined because multiple valid reconstructions exist for the same text with no ground truth