A Robustness Evaluation Framework for Argument Mining

Paper · Source

Standard practice for evaluating the performance of machine learning models for argument mining is to report different metrics such as accuracy or F1. However, little is usually known about the model’s stability and consistency when deployed in real-world settings. In this paper, we propose a robustness evaluation framework to guide the design of rigorous argument mining models. As part of the framework, we introduce several novel robustness tests tailored specifically to argument mining tasks. Additionally, we integrate existing robustness tests designed for other natural language processing tasks and re-purpose them for argument mining. Finally, we illustrate the utility of our framework on two widely used argument mining corpora, UKP topic-sentences and IBM Debater Evidence Sentence. We argue that our framework should be used in conjunction with standard performance evaluation techniques as a measure of model stability.

Introduction. Deep learning models have obtained state-of-theart results on a wide range of Natural Language Processing (NLP) tasks and have even achieved superhuman performance on benchmark tasks (Wang et al., 2019). The standard approach for evaluating machine learning models is to use held-out data and report various performance metrics such as accuracy and F1. However, reporting an aggregate statistic on benchmarks does not reflect the model’s performance and robustness when applied to real-world texts. Indeed, recent works have shown that NLP models are not robust to perturbations. For instance, natural language inference (NLI) models classify a permuted example where word positions are randomly changed, as they would classify the original input (Sinha et al., 2021), and sentiment analysis models give a lower sentiment score when a positive phrase is added to the original example (Ribeiro et al., 2020). Koch et al. (2021) argue for rigorous evaluation to avoid poor generalisability, whereas Raji et al. (2021) propose systematic development of test suites.

Discussion / Conclusion. We proposed a robustness evaluation framework for machine learning-based argument mining models. Our framework is model-agnostic and only requires access to the data. We presented 15 simulation functions, amongst which 6 are novel and tailored for the argument classification task by exploiting sentence-level topic information within an argument or motion, with the rest of the functions re-purposed for argument mining tasks. These can be used to automatically create simulated datasets, designed to mimic realistic settings which can be used to test the model’s robustness. We illustrated the utility of our framework on two widely used argument mining corpora, UKP topic-sentences and IBM Debater Evidence Sentence and showed that, while robust, BERT models can still be vulnerable to new inputs. Our robustness evaluation framework can be used to enhance the standard performance evaluation in order to create better models for argument mining by measuring model stability. We experimented with the major corpora available for argument mining, however our framework can be applied to datasets for relation prediction in argument mining (Cocarascu et al., 2020). There are several avenues for future work.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

Do language models learn genuine linguistic structure or just surface patterns?

How does reasoning graph topology affect breakthrough insights and generalization?

Do language models understand semantics or rely on pattern matching?

What is the difference between learning discourse patterns and learning abstract language?

How should retrieval systems optimize for multi-step reasoning during inference?

What makes intent taxonomies unmanageable at hundreds of intents?

Why do multi-turn conversations degrade AI intent and coherence?

Why do discourse failures cluster in attention and intentional layers rather than linguistics?

Why do language models struggle with implicit discourse relations?

When should retrieval-augmented systems decide to fetch new information?

Why does standard RAG succeed for evidence-based but fail for debate questions?

What makes specific clarifying questions more effective than generic ones?

How should dialogue systems best leverage conversation history for retrieval?

How do adversarial and manipulative prompts attack reasoning models?

How do the six trap categories map onto detection difficulty?

Why do reasoning models fail at systematic problem-solving and search?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do LLM descriptions of argument schemes work better than formal definitions for classification?

A Robustness Evaluation Framework for Argument Mining

Synthesis notes that discuss concepts related to this paper 1

Lines of inquiry this paper opens 24