SYNTHESIS NOTE

Why do different people reconstruct the same argument differently?

When humans and LLMs extract logical structure from arguments, they produce different reconstructions. Is this disagreement a problem to solve, or does it reveal something fundamental about how arguments work?

Synthesis note · 2026-02-21 · sourced from Argumentation

Argunauts (Argument Annotation Units) is a dataset and benchmark for argument reconstruction — extracting explicit logical structures from natural language arguments. The dataset's most significant finding is methodological: when multiple annotators (human and LLM) reconstruct the same argument independently, they produce different but equally valid reconstructions.

This is not annotation disagreement in the sense of noise to be resolved. Multiple reconstruction schemas — different choices about what counts as a premise, how to formalize the conclusion, what implicit assumptions to make explicit — are each internally valid. There is no gold standard because the text underdetermines the reconstruction.

This connects directly to Why do readers interpret the same sentence so differently? but at a structural rather than semantic level. Interpretive multiplicity in NLI is about meaning — what a sentence means depends on the reader's social position. Reconstruction multiplicity in argumentation is about structure — how an argument should be formalized depends on which reconstruction schema is applied.

Both findings converge on a challenge to the NLP assumption that language processing tasks have unique correct outputs. Do standard NLP benchmarks hide LLM ambiguity failures? describes how benchmarks respond to this problem by exclusion. For argumentation, exclusion is not possible — underdetermination is not a feature of edge cases but of the task itself.

The practical implication: evaluating LLMs on argument reconstruction requires acknowledging that precision and recall metrics assume ground truth that does not exist. Models that disagree with a reference annotation may be producing equally valid reconstructions. The field is measuring agreement with one valid interpretation and calling it correctness.

This also grounds Why do speakers deliberately use ambiguous language? from a new angle: structural ambiguity (multiple valid formalizations of the same argument) is as fundamental as semantic ambiguity.

Inquiring lines that read this note 11

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can AI-generated outputs constitute genuine knowledge or valid claims?

Why do stakeholders interpret the same explanation differently in practice?

What factors beyond surface content determine how readers extract meaning differently?

Why does describing a process differ fundamentally from arguing about evidence?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why does the same recalled information lead to different reasoning conclusions?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What makes AI persuasion effective and how can we counter it?

Why does who makes an argument matter as much as what the argument says?

How does reasoning graph topology affect breakthrough insights and generalization?

Why should disagreement be treated as signal in collaborative reasoning?

What makes an argument fallacious according to formal linguistic criteria?

When does optimizing for quality undermine the value of diversity?

Why does argument diversity matter more than individual argument quality?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 142 in 2-hop network ·dense cluster Open in graph ↗

Why do different people reconstruct the same arg… Why do readers interpret the same sentence so diff… Why do speakers deliberately use ambiguous languag… Do standard NLP benchmarks hide LLM ambiguity fail…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do readers interpret the same sentence so differently? How much of annotation disagreement in NLP reflects genuine interpretive multiplicity rather than error? This explores whether social position and moral framing systematically generate competing but equally valid readings.
semantic multiplicity; this is structural multiplicity; same root problem
Why do speakers deliberately use ambiguous language? Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.
the broader principle this exemplifies at the argument-structure level
Do standard NLP benchmarks hide LLM ambiguity failures? When benchmark creators filter out ambiguous examples before testing, do they accidentally make it impossible to measure whether language models can actually handle ambiguity the way humans do?
benchmark exclusion as the standard NLP response to underdetermination

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

argument reconstruction is fundamentally underdetermined because multiple valid reconstructions exist for the same text with no ground truth

Why do different people reconstruct the same argument differently?

Inquiring lines that read this note 11

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4