0% found this document useful (0 votes)
18 views11 pages

Counterargument Retrieval

This document discusses a method for automatically retrieving the best counterarguments to any given argument on controversial topics, without requiring prior knowledge of the topic. The authors propose a model that evaluates the similarity and dissimilarity of argument pairs based on their premises and conclusions, achieving a 60% accuracy in ranking the best counterarguments. The research utilizes a new corpus of argument-counterargument pairs from idebate.org, aiming to enhance applications in automatic debating technologies and argument search.

Uploaded by

gogigorgonzola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Counterargument Retrieval

This document discusses a method for automatically retrieving the best counterarguments to any given argument on controversial topics, without requiring prior knowledge of the topic. The authors propose a model that evaluates the similarity and dissimilarity of argument pairs based on their premises and conclusions, achieving a 60% accuracy in ranking the best counterarguments. The research utilizes a new corpus of argument-counterargument pairs from idebate.org, aiming to enhance applications in automatic debating technologies and argument search.

Uploaded by

gogigorgonzola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Retrieval of the Best Counterargument without Prior Topic Knowledge

Henning Wachsmuth Shahbaz Syed and Benno Stein


Paderborn University Bauhaus-Universität Weimar
Computational Social Science Group Faculty of Media, Webis Group
[email protected] <first>.<last>@uni-weimar.de

Abstract Argument “Gun ownership is an integral aspect


Given any argument on any controversial of the right to self defence. (conclusion)
topic, how to counter it? This question im- Law-abiding citizens deserve the right to protect
plies the challenging retrieval task of find- their families in their own homes, especially if the
ing the best counterargument. Since prior police are judged incapable of dealing with the
knowledge of a topic cannot be expected in threat of attack. [...]” (premise)
general, we hypothesize the best counterar- While the conclusion seems well-reasoned, the web
gument to invoke the same aspects as the portal directly provides a counter to the argument:
argument while having the opposite stance. Counterargument “Burglary should not be pun-
To operationalize our hypothesis, we simul- ished by vigilante killings of the offender. No
taneously model the similarity and dissim- amount of property is worth a human life. Per-
ilarity of pairs of arguments, based on the versely, the danger of attack by homeowners may
words and embeddings of the arguments’ make it more likely that criminals will carry their
premises and conclusions. A salient prop- own weapons. If a right to self-defence is granted
erty of our model is its independence from in this way, many accidental deaths are bound to
the topic at hand, i.e., it applies to arbitrary result. [...]”
arguments. We evaluate different model
variations on millions of argument pairs de- As in this example, we observe that a counterargu-
rived from the web portal idebate.org. Sys- ment often takes on the aspects of the topic invoked
tematic ranking experiments suggest that by the argument, while adding a new perspective
our hypothesis is true for many arguments: to its conclusion and/or premises, conveying the
For 7.6 candidates with opposing stance on opposite stance. Research has tackled the stance of
average, we rank the best counterargument argument units (Bar-Haim et al., 2017) as well as
highest with 60% accuracy. Even among the attack relations between arguments (Cabrio and
all 2801 test set pairs as candidates, we still Villata, 2012). However, existing approaches learn
find the best one about every third time. the interplay of aspects and topics on training data
or infer it from external knowledge bases (details
1 Introduction in Section 2). This does not work for topics unseen
Many controversial topics in real life divide us into before. Moreover, to our knowledge, no work so
opposing camps, such as whether to ban guns, who far aims at actual counterarguments.
should become president, or what phone to buy. This paper studies the task of automatically find-
When being confronted with arguments against our ing the best counterargument to any argument. In
stance, but also when forming own arguments, we the general case, we cannot expect prior knowledge
need to think about how they could best be coun- of an argument’s topic. Following the observation
tered. Argumentation theory tells us that — aside above, we thus just hypothesize the best counterar-
from ad-hominem attacks — a counterargument gument to invoke the same aspects as the argument
denies either an argument’s premises, its conclu- while having the opposite stance. Figure 1 sketches
sion, or the reasoning between them (Walton, 2009). how we operationalize the hypothesis. In particular,
Take the following argument in favor of the right we simultaneously model the topic similarity and
to bear arms from the web portal idebate.org: stance dissimilarity of a candidate counterargument

241
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 241–251
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
argument (e.g., sum) The corpus as well as the Java source code for
~
similarity to topic reproducing the experiments are available at http:
conclusion similarity
//www.arguana.com.
simultaneous
similarity and
dissimilarity
2 Related Work
similarity to stance
premises similarity
(e.g., max)
Counterarguments rebut arguments. In the theoreti-
counterargument
cal model of Toulmin (1958), a rebuttal in fact does
Figure 1: Modeling the simultaneous similarity and not attack the argument, but it merely shows excep-
dissimilarity of a counterargument to an argument. tions to the argument’s reasoning. Govier (2010)
suggests to rather speak of counterconsiderations
in such cases. Unlike Damer (2009), who investi-
to the argument. Both are inferred — in different gates how to attack several kinds of fallacies, we
ways — from the similarities to the argument’s con- are interested in how to identify attacks. We focus
clusion and premises, since it is unclear in advance, on those that target arguments, excluding personal
whether either of these units or the reasoning be- (ad-hominem) attacks (Habernal et al., 2018).
tween them is countered. Thereby, we find the most
Following Walton (2006), an argument can be
dissimilar among the most similar arguments.
attacked in two ways: one is to question its validity
To study counteraguments, we provide a new
— not meaning that its conclusion must be wrong.
corpus with 6753 argument-counterargument pairs,
The other is to rebut it with a counterargument that
taken from 1069 debates on idebate.org, as well as
entails the opposite conclusion, often by revisiting
millions of false pairs derived from them. Given the
aspects or introducing new ones. This is the type of
corpus, we define eight retrieval tasks that differ in
attack we study. As Walton (2009) details, rebuttals
the types of candidate counterarguments. Based on
may target an argument’s premises or conclusion,
the words and embeddings of the arguments, we de-
or they may undercut the reasoning between them.
velop similarity functions that realize the outlined
Recently, the computational analysis of natural
model as a ranking approach. In systematic experi-
language argumentation is receiving much atten-
ments, we evaluate the different building blocks of
tion. Most research focuses on argument mining,
our model on all defined tasks.
ranging from segmenting a text into argument units
The results suggest that our hypothesis is true
(Ajjour et al., 2017), over identifying unit types
for many arguments. The best model configuration
(Rinott et al., 2015) and roles (Niculae et al., 2017),
improves common word and embedding similarity
to classifying argument schemes (Feng and Hirst,
measures by eight to ten points accuracy in all tasks.
2011) and relations (Lawrence and Reed, 2017).
Inter alia, we rank 60.3% of the best counterargu-
Some works detect counterconsiderations in a text
ments highest when given all arguments with op-
(Peldszus and Stede, 2015) or their absence (Stab
posite stance (7.6 on average). Even with all 2801
and Gurevych, 2016). Such considerations make
test arguments as candidates, we still achieve 32.4%
arguments more balanced (see above). In contrast,
(and a mean rank of 15), fitting the intuition that off-
we seek for arguments that defeat others.
topic arguments are easier to discard. Our analysis
Many approaches mine attack relations between
reveals notable gaps across topical themes though.
arguments. Some use deep learning to find attacks
Contributions We believe that our findings will in discussions (Cocarascu and Toni, 2017). Closer
be important for applications such as automatic to this paper, others determine them in a given set
debating technologies (Rinott et al., 2015) and ar- of arguments, using textual entailment (Cabrio and
gument search (Wachsmuth et al., 2017b). To sum- Villata, 2012) or a combination of markov logic and
marize, our main contributions are: stance classification (Hou and Jochim, 2017). In
principle, any attacking argument denotes a coun-
• A large corpus for studying multiple counter-
terargument. Unlike previous work, however, we
argument retrieval tasks (Sections 3 and 4).
aim for the best counterargument to an argument.
• A topic-independent approach to find the best Classifying the stance of a text towards a topic
counterargument to any argument (Section 5). (pro or con) generally defines an alternative way of
• Evidence that many counterarguments can be addressing counterarguments. Sobhani et al. (2015)
found without topic knowledge (Section 6). specifically classify health-related arguments using

242
supervised learning, while we do not expect to have 2016). Our corpus is not only larger, though, but it
prior topic knowledge. Bar-Haim et al. (2017) ap- is the first to utilize a unique feature of idebate.org:
proach the stance of claims towards open-domain the explicit specification of counterarguments.
topics. Their approach combines aspect-based sen-
timent with external relations between aspects and 3 The ArguAna Counterargs Corpus
topics from Wikipedia. As such, it is in fact limited
This section introduces our ArguAna Counterargs
to the topics covered there. Our model applies to
corpus with argument-counterargument pairs, cre-
arbitrary arguments and counterarguments.
ated automatically from the structure of idebate.org.
We need to identify only whether arguments op-
The corpus is freely available at http://www.
pose each other, not their actual stance. Similarly,
arguana.com/data. We also provide the code
Menini et al. (2017) classify only the disagreement
to replicate the construction process.
of political texts. Part of their approach is to de-
tect topical key aspects in an unsupervised manner, 3.1 The Web Portal idebate.org
which seems useful for our purposes. Analogously,
On the portal idebate.org, diverse controversial top-
Beigman Klebanov et al. (2010) study differences
ics of usually rather general interest are discussed
in vocabulary choice for the related task of per-
in debates, subsumed under 15 themes, such as
spective classification, and Tan et al. (2016) find
“economy” and “health”. Each debate has a title
that the best way to persuade opinion holders in
capturing a thesis on a topic, such as “This House
the Change my view forum on reddit.com is to
would limit the right to bear arms”, followed by an
use dissimilar words. As we report later, however,
introductory text, a set of mostly elaborated and
our experiments did not show such results for the
well-written points that have a pro or a con stance
argument-counterargument pairs we deal with.
towards the thesis, and a bibliography.
The goal of persuasion reveals the association of
A specific feature of idebate.org is that virtually
counterarguments to argumentation quality. Many
every point comes along with a counter that imme-
quality criteria have been assessed for arguments,
diately attacks the point and its stance. Both points
surveyed in (Wachsmuth et al., 2017a). In the study
and counters can be seen as arguments. While a
of Habernal and Gurevych (2016), one reason an-
point consists of a one-sentence claim (the argu-
notators gave for why an argument was more con-
ment’s conclusion) and a few sentences justifying
vincing than another was that it tackled flaws in the
the claim (the premise(s)), the counter’s (opposite)
opposing view. Zhang et al. (2016) even found that
conclusion remains implicit.
debate winners tend to counter opposing arguments
rather than focusing on their own arguments. All arguments on the portal are established by
a community with the goal of showing both sides
Argument quality assessment is particularly im-
of a topic in a balanced manner. We therefore as-
portant in retrieval scenarios. Existing approaches
sume each counter to be the best counterargument
aim to retrieve documents that contain many claims
available for the respective point, and we use all
(Roitman et al., 2016) or that provide most sup-
resulting true argument pairs as the basis of our
port for their claims (Braunstain et al., 2016). In
corpus. Figure 2 illustrates the italicized concepts,
Wachsmuth et al. (2017c), we adapt PageRank to ar-
showing the structure of idebate.org. An example
gumentative relations, in order to assess argument
argument pair has been discussed in Section 1.
relevance objectively. While our search engine args
for arguments on the web still uses content-based 3.2 Corpus Construction
relevance measures in its first version (Wachsmuth
et al., 2017b), its long-term idea is to rank the best We crawled all debates from idebate.org that fol-
arguments highest.1 The model present in this work low the portal’s theme-guided folder structure (last
finds the best counterarguments, but it is meant to access: January 30, 2018). From each debate, we
be integrated into args at some point. extracted the thesis, the introductory text, all points
Like here, args uses idebate.org arguments. Oth- and counters, the bibliography, and some metadata.
ers take data from that portal for studying support Each was stored separately in one plain text file,
(Boltužić and Šnajder, 2014) or for the distant su- and we also created a file with the entire debate in
pervision of argument mining (Al-Khatib et al., its original order. Only points and counters are used
in our experiments in Section 6. The underlying
1
Argument search engine args: http://args.me experiment settings are described in Section 4.

243
true other argument pairs other debates debates
argument pair in same debate from same theme from other themes
other points points with points from points from
point with same stance opposite stance other debates other themes
argument (e) (d) (g) (i)
conclusion
argument
premise(s)

... ... ... ...


true counters to counters to counters from counters from
counter same stance opposite stance other debates other themes
(conclusion
implicit) (a) (b) (c) (f) (h)
argument
premise(s)

Figure 2: Structure of idebate.org for one true argument pair in our corpus. Colors denote matching stance;
we assume arguments from other debates to have no stance towards a point. Points have a conclusion and
premises, counters only premises. (a)–(i) are used in Section 4 to specify the candidates in different tasks.

Theme Debates Points Counters 3.4 Datasets


Culture 46 278 278 We split the corpus into a training set, consisting of
Digital freedoms 48 341 341
Economy 95 590 588 the first 60% of all debates of each theme (ordered
Education 58 382 381 by alphabet), as well as a validation set and a test
Environment 36 215 215 set, each covering 20%. The dataset sizes are found
Free speech debate 43 274 273 at the bottom of Table 1. By putting all arguments
Health 57 334 333
International 196 1315 1307 from a debate into a single dataset, no specific
Law 116 732 730 topic knowledge can be gained from the training or
Philosophy 50 320 320 validation set. We include all themes in all datasets,
Politics 155 982 978 because we expect the set of themes to be stable.
Religion 30 179 179
Science 41 271 269 We checked for duplicates. Among the 13 532
Society 75 436 431 point and counters, 3407 appear twice, 723 three
Sport 23 130 130
times, 36 four times, and 1 five times. We ensure
Training set 644 4083 4065 that no true pair is used as a false pair in our tasks.
Validation set 211 1290 1287
Test set 214 1406 1401
4 Counterargument Retrieval Tasks
counterargs-18 1069 6779 6753
Based on the new corpus, we define the following
Table 1: Distribution of debates, points, and coun- eight counterargument retrieval tasks of different
ters over the themes in the counterargs-18 corpus. complexity. All tasks consider all true argument-
The bottom rows show the size of the datasets. counterargument pairs, while differing in terms
of what arguments (points and/or counters) from
3.3 Corpus Statistics which context (same debate, same theme, or entire
Table 1 lists the number of debates crawled for each portal) are candidates for a given argument.
theme, along with the numbers of points and coun- Same Debate: Opposing Counters All coun-
ters in the debates. The 26 found points without a ters in the same debate with stance opposite to
counter are included in the corpus, but we do not the given argument are candidates (Figure 2: a, b).
use them in our experiments. The task is to find the best counterargument among
In total, the ArguAna Counterargs corpus con- all counters to the argument’s stance.
sists of 1069 debates with 6753 points that have a
counter. The mean length of points is 196.3 words, Same Debate: Counters All counters in the
whereas counters span only 129.6 words, largely same debate irrespective of their stance are can-
due to the missing explicit conclusion. To avoid didates (Figure 2: a–c). The task is to find the best
exploiting this corpus bias, no approach in our ex- counterargument among all on-topic arguments
periments captures length differences. phrased as counters.

244
Training Set Validation Set Test Set
Context Candidate Counterarg’s True False Ratio True False Ratio True False Ratio
Same debate Opposing counters 4 065 11 672 1:2.9 1 287 3 590 1:2.8 1 401 4 052 1:2.9
Counters 4 065 27 024 1:6.6 1 287 8 348 1:6.5 1 401 9 312 1:6.6
Opposing arguments 4 065 27 026 1:6.6 1 287 8 350 1:6.5 1 401 9 312 1:6.6
Arguments 4 065 54 070 1:13.3 1 287 16 700 1:13.0 1 401 18 630 1:13.3
Same theme Counters 4 065 1 616 000 1:398 1 287 176 266 1:137 1 401 189 870 1:136
Arguments 4 065 3 232 038 1:795 1 287 352 536 1:274 1 401 379 746 1:271
Entire portal Counters 4 065 16 517 994 1:4063 1 287 1 654 878 1:1286 1 401 1 961 182 1:1400
Arguments 4 065 33 038 154 1:8127 1 287 3 309 760 1:2572 1 401 3 922 582 1:2800

Table 2: Number of true and false argument-counterargument pairs as well as their ratio for each evaluated
context and type of candidate counterarguments in the three datasets. Each line defines one retrieval task.

Same Debate: Opposing Arguments All argu- presents our approach to solving these problems
ments in the same debate with opposite stance are computationally without prior knowledge of the
candidates (Figure 2: a, b, d). The task is to find argument’s topic, based on the simultaneous simi-
the best among all on-topic counterarguments. larity and dissimilarity of arguments.2
Same Debate: Arguments All arguments in the 5.1 Topic as Word and Embedding Similarity
same debate irrespective of their stance are candi-
dates (Figure 2: a–e). The task is to find the best We do not reinvent the wheel to assess topical rel-
counterargument among all on-topic arguments. evance, but rather follow common practice. Con-
cretely, we hypothesize a candidate counterargu-
Same Theme: Counters All counters from the ment to be on-topic if it is similar to the argument in
same theme are candidates (Figure 2: a–c, f). The terms of its words and its embedding. We capture
task is to find the best counterargument among all these two types of similarity as follows.
on-theme arguments phrased as counters.
Word Argument Similarity To best represent
Same Theme: Arguments All arguments from
the words in arguments, we did initial counterargu-
the same theme are candidates (Figure 2: a–g). The
ment retrieval tests with token, stem, and lemma
task is to find the best counterargument among all
n-grams, n ∈ {1, 2, 3}. While the differences were
on-theme arguments.
not large, stems worked best and stem 1-grams suf-
Entire Portal: Counters All counters are candi- ficed. Both might be a consequence of the limited
dates (Figure 2: a–c, f, h). The task is to find the data size. In our experiments in Section 6, we de-
best counterargument among all arguments phrased termine the stem 1-grams to be considered on the
as counters. training set of each task.
For word similarity computation, we tested four
Entire Portal: Arguments All arguments are
inverse vector-based distance measures: Cosine,
candidates (Figure 2: a–i). The task is to find the
Euclidean, Manhattan, and, Jaccard similarity (Cha,
best counterargument among all arguments.
2007). On the validation sets, the Manhattan sim-
Table 2 lists the numbers of true and false pairs for ilarity performed best, closely followed by the
each task. Experiment files containing the file paths Jaccard similarity. Both clearly outperformed Eu-
of all candidate pairs are provided in our corpus. clidean and especially Cosine similarity. This sug-
gests that the presence and absence of words are
5 Retrieval of the Best Counterargument equally important and that outliers should not be
without Prior Topic Knowledge punished more. For brevity, we report only results
The eight defined tasks indicate the subproblems for the Manhattan similarity below.
of retrieving the best counterargument to a given 2
As indicated above, counters on idebate.org (including
argument: Finding all arguments that address the all true counterarguments) may also differ linguistically from
same topic, filtering those arguments with an oppo- points (all of which are false). However, we assume this to be
a specific corpus bias and hence do not explicitly account for it.
site stance towards the topic, and identifying the Section 6 will show whether having both points and counters
best counter among these arguments. This section as candidates makes counterargument retrieval harder.

245
Embedding Argument Similarity We evalu- there must be a difference between an argument
ated five pretrained word embedding models for and its counterargument by concept. As a solution,
representing arguments in first tests: GoogleNews- we capture dissimilarity with the same similarity
vectors (Mikolov et al., 2013), ConceptNet Num- functions as above, but we change the granularity
berbatch (Speer et al., 2017), wiki-news-300d-1M, level on which we measure similarity.
wiki-news-300d-1M-subword, and crawl-300d-2M
(Mikolov et al., 2017). The former two were com- 5.3 Simultaneous Similarity and Dissimilarity
petitive, the others performed notably worse. Since The arising question is how to assess similarity and
ConceptNet Numberbatch is smaller and supposed dissimilarity at the same time. We hypothesize the
to have less bias, we used it in all experiments. best counterargument to be very similar in overall
To capture argument-level embedding similar- terms, but very dissimilar in certain respects. To
ity, we compared the four inverse vector-based dis- capture this intuition, we rely on expert knowledge
tance measures above on average word embeddings from argumentation theory (see Section 2).
against the inverse Word Mover’s distance, which
Word and Embedding Unit Similarities In par-
quantifies the optimum alignment of two word em-
ticular, we follow the notion that a counterargument
bedding sequences (Kusner et al., 2015). This Word
attacks either the conclusion of an argument, the ar-
Mover’s similarity consistently beat the others, so
gument’s premises, or both. As a consequence, we
we decided to restrict our view to it.
compute two word and two embedding similarities
as specified above for each candidate counterargu-
5.2 Stance as Topic Dissimilarity ment; once to the argument’s conclusion (called wc
Stance classification without prior topic knowledge and ec for words and embeddings respectively) and
is challenging: While we can compare the topics once to the argument’s premises (wp and ep ).
of any two arguments, it is impossible in general Now, to capture similarity and dissimilarity si-
to infer the stance of the specific aspects invoked multaneously, we need multiple ways to aggregate
by one argument to those of the other. As sketched conclusion and premise similarities. As we do not
in Section 2, related work employs external knowl- generally know which argument unit is attacked,
edge to infer stance relations of aspects and topics we resort to four standard aggregation functions
(Bar-Haim et al., 2017) or trains classifying attack that generalize over the unit similarities. For words,
relations (Cabrio and Villata, 2012). Unfortunately, these are the following word unit similarities:
both does not apply to topics unseen before.
w↓ := min{wc , wp } w× := wc · wp
For argument pairs invoking similar aspects, a
way to go is in principle to assess sentiment polar- w↑ := max{wc , wp } w+ := wc + wp
ity; intuitively, two arguments with the same topic
but opposite sentiment have opposing stance. How- Accordingly, we define four respective embedding
ever, we tested topic-agnostic sentiment lexicons unit similarities, e↓ , e↑ , e× , and e+ .
(Baccianella et al., 2010) and state-of-the-art sen- As mentioned above, both word similarity and
timent classifiers, trained on large-scale multiple- embedding similarity positively affect the likeli-
domain review data (Prettenhofer and Stein, 2010; hood that a candidate is the best counterargument.
Joulin et al., 2017). The correlation between senti- Therefore, we combine each pair of similarities as
ment and stance differences of training arguments w↓ + e↓ , w↑ + e↑ , w× + e× , and w+ + e+ , but we
was close to zero. A possible explanation is the lim- also evaluate their impact in isolation below.3
ited explicitness of sentiment on idebate.org, mak- Counterargument Scoring Model Based on
ing the lexicons and classifiers fail there. the unit similarities, we finally define a scoring
Other related work suggests that the vocabulary model for a given pair of argument and candidate
of opposing sides differs (Beigman Klebanov et al., counterargument. The model includes two unit
2010). We thus checked on the training set whether similarity values, sim and dissim, but dissim is
counterarguments are similar in their embeddings subtracted from sim, such that it actually favors
but dissimilar in their words. The measures above dissimilarity. Thereby, we realize the topic and
did not support this hypothesis, i.e., both embed- 3
In principle, other unit similarities could be used for
ding and word similarity increased the likelihood of words than for embeddings. However, we decided to cou-
a candidate counterargument being the best. Still, ple them to maintain interpretability of our experiment results.

246
stance similarity sketched in Figure 1. We weight Data Table 2 has shown the true and false argu-
the two values with a damping factor α: ment pairs in all datasets. We undersampled each
training set, resulting in 4065 true and 4065 false
α · sim − (1 − α) · dissim training pairs in all tasks.5 Our model does not do
any learning-to-rank on these pairs, but we derived
where sim, dissim ∈ {w↓ +e↓ , w↑ +e↑ , w× +e× , lexicons for the word similarities from them (all
w+ + e+ } and sim 6= dissim. stems included in at least 1% of all pairs). As de-
The general idea of the scoring model is that tailed below, we then determined the best model
sim rewards one type of similarity, whereas sub- configurations on the validation sets and evaluated
tracting dissim punishes another type. We seek to these configurations on the test sets.
thereby find the most dissimilar candidate among
the similar candidates. The model is meant to give Measures As only one candidate is true per argu-
a higher score to a pair the more likely the candi- ment, we report the accuracy@1 of each approach,
date is the best counterargument to the argument, i.e., the percentage of arguments for which the true
so the scores can be used for ranking. counterargument was ranked highest. Besides, we
What combination of sim and dissim turns out compute the rounded mean rank of the best coun-
best, is hard to foresee and may depend on the terargument in all rankings, reflecting the average
retrieval task at hand. We hence evaluate different performance of an approach. Exemplarily, we also
combinations empirically below. The same holds mention the mean reciprocal rank (MRR), which is
for the damping factor α ∈ [0, 1]. If our hypothesis more sensitive to outliers.
on similarity and dissimilarity is true, then the best Baselines A trivial way to address the given tasks
α should be close to but lower than 1. Conversely, is to pick any candidate by chance for each argu-
if α = 1.0 achieves the best performance, then only ment. This random baseline allows quantifying the
similarity would be captured by our model. impact of other approaches. As counterargument
retrieval has not been tackled yet, we do not use
6 Evaluation
any existing baseline.6 Instead, we evaluate the ef-
We now report on systematic ranking experiments fects of the different building blocks of our scoring
with our counterargument scoring model. The goal model. On one hand, we check the need for distin-
is to evaluate on all eight retrieval tasks from Sec- guishing conclusions and premises by comparing
tion 4 to what extent our hypothesis holds that the to the word argument similarity (w) and the embed-
best counterargument to an argument invokes the ding argument similarity (e). On the other hand,
same aspects while having opposing stance. The we consider all eight word and embedding unit sim-
Java source code of the experiments is available at: ilarities (w↓ , w↑ , . . . , e+ ) as baselines, in order to
http://www.arguana.com/software see whether and how to best aggregate them.

6.1 Experimental Set-up Approaches After initial tests, we reduced the


set of tested values of the damping factor α in our
We evaluated the following set-up of tasks, data,
scoring model to {0.8, 0.9, 1.0}. On the validation
measures, baselines, and approaches.
sets of the first six tasks,7 we then analyzed all
Tasks We tackled each of the eight retrieval tasks possible combinations of w↓ +e↓ , w↑ +e↑ , w× +e× ,
as a ranking problem, i.e., we aimed to rank the best w+ + e+ , as well as w + e for sim and dissim.
counterargument to each argument highest, given Three configurations of the model turned out best:
all candidates. Accordingly, only one candidate
counterargument per argument is correct.4 we := 1.0 · (w× + e× )
4
we↓ := 0.9 · (w× + e× ) − 0.1 · (w↓ + e↓ )
One alternative would be to see each argument pair as
one instance of a classification problem. However, our pre- we↑ := 0.9 · (w+ + e+ ) − 0.1 · (w↑ + e↑ )
liminary tests confirmed the intuition that identifying the best
5
counterargument is hard without knowing the other candidates, Undersampling was done stratified, such that the same
i.e., there is no general (dis)similarity threshold that makes an number of false counterarguments was taken from each type,
argument the best counterargument. Rather, how similar or b–i, in Figure 2 that is relevant in the respective task.
6
dissimilar a counterargument needs to be depends on the topic Notice, though, that we tested a number of approaches to
and on the other candidates. Another alternative would be to identify opposing stance, as discussed in Section 5.
7
treat all candidates for an argument as one instance, but this We did not expect “game-changing” validation set results
makes the experimental set-up very intricated. for the last two tasks and, so, left them out for time reasons.

247
Same Debate Same Theme Entire Portal
Opp. Ctr.’s Counters Opposing Arguments Counters Arguments Counters Arguments
# Baseline / Approach @1 R @1 R @1 R @1 R @1 R @1 R @1 R @1 R
w Word argument similarity 65.9 2 48.5 2 42.5 3 30.0 4 44.1 5 28.3 10 39.7 22 21.8 49
e Embedding argument similarity 62.9 2 44.6 2 51.6 2 36.8 4 38.8 7 32.9 10 34.2 39 23.9 55
w↓ Word unit similarity minimum 53.8 2 38.4 3 45.9 3 33.7 5 28.5 22 24.8 42 21.4 206 18.5 403
w↑ Word unit similarity maximum 66.1 2 48.0 2 44.0 3 30.2 4 44.0 5 28.3 9 38.0 21 21.2 44
w× Word unit similarity product 64.9 2 49.5 3 56.1 2 40.7 4 44.3 18 36.8 35 37.8 177 26.8 354
w+ Word unit similarity sum 71.5 1 53.7 2 54.1 2 39.1 4 49.0 4 36.8 7 44.7 17 28.6 33
e↓ Embedding unit sim. minimum 61.6 2 44.9 3 43.4 3 32.1 4 37.8 7 27.4 13 32.5 42 20.7 74
e↑ Embedding unit sim. maximum 63.4 2 44.5 2 47.5 2 33.2 4 39.8 5 29.8 8 32.1 20 20.1 33
e× Embedding unit sim. product 69.7 1 52.0 2 55.4 2 41.0 3 44.3 4 37.1 6 43.2 14 27.8 21
e+ Embedding unit sim. sum 69.7 1 51.8 2 55.4 2 40.5 3 47.5 4 36.8 6 43.0 13 27.6 21
we 1.0·(w× +e× ) 72.1 1 55.2 2 ‡60.3 2 †44.9 3 50.4 4 40.9 7 46.0 19 32.2 34
we↓ 0.9·(w× +e× ) −0.1·(w↓ +e↓ ) 72.0 1 55.5 2 59.5 2 44.1 3 51.3 4 †41.0 7 46.3 19 31.7 35
we↑ 0.9·(w+ +e+ ) −0.1·(w↑ +e↑ )†74.5 1 †57.7 2 59.6 2 44.1 3 ‡54.2 3 40.8 5 ‡50.0 9 ‡32.4 15
r Random baseline 25.7 2 13.1 4 13.1 4 7.0 7 0.7 69 0.4 137 0.1 701 0.0 1401

Table 3: Test set accuracy of ranking the best counterargument highest (@1) and mean rank (R) for 14
baselines and approaches (w, e, w↓ , . . . , r) in all eight tasks (given by Context and Candidates). Each best
accuracy value (bold) significantly outperforms the best baseline with 99% (†) or 99.9% (‡) confidence.

we was best on the validation set of Same Debate: w is stronger when only counters are candidates, e
Opposing Arguments (accuracy@1: 62.1) and we↓ otherwise. This implies that words capture differ-
on the one of Same Debate: Arguments (49.0). All ences between the best and other counters, whereas
other tasks were dominated by we↑ . Especially, embeddings rather help discard false candidates
we↑ was better than 1.0 · (w+ + e+ ) in all of them with the same stance as the argument.
with clear leads of up to 2.2 points. This underlines From the eight unit similarity baselines, w+ per-
the importance of modeling dissimilarity for coun- forms best on five tasks (e× twice, w× once). w+
terargument retrieval. We took we, we↓ , and we↑ finds 71.5% true counterarguments among all op-
as our approaches for the test set.8 posing counters in a debate, and 28.6% among all
test arguments from the entire portal. In that task,
6.2 Results however, the mean ranks of w+ (33) and particu-
Table 3 shows the accuracy@1 and the mean rank larly of w× (354) are much worse than for e× (21),
of all baselines and approaches on each of the eight meaning that words are insufficient to robustly find
given retrieval tasks. counterarguments.
Overall, the counter-only tasks seem slightly we, we↓ , and we↑ outperform all baselines in all
harder within the same debate (comparing Coun- tasks, improving the accuracy by 8.1 (Same Theme:
ters to Opposing), i.e., stance is harder to assess Arguments) to 10.3 points (Entire Portal: Counters)
than topical relevance. Conversely, the other Coun- over w and e, and at least 3.0 over the best baseline
ters tasks seem easier, suggesting that topically in each task. Among all opposing arguments from
close but false candidate counterarguments with the same debate (true-to-false ratio 1:6.6), we finds
the same stance as the argument (which are not in- 60.3% of the best counterarguments, 44.9% when
cluded in any Counters task) are classified wrongly all arguments are given (1:13.3).
most often. Besides, these results support that po- The winner in our evaluation is we↑ , though,
tential differences in the phrasing of counters are being best in five of the eight tasks. It found the
not exploited, as desired. true among all opposing counters in 74.5% of all
The accuracy of the standard similarity measures, cases, and about every third time (32.4) among
w and e, goes from 65.9 and 62.9 respectively in the all 2801 test set arguments; a setting where the
smallest task down to 21.8 and 23.9 in the largest. random baseline has virtually no chance. Given
8
all arguments from the same theme, we↑ puts the
All validation set results are found in the supplemen-
tary material, which we provide at http://www.arguana. best counterargument at a mean rank of 5 (MRR
com/publications 0.58), and for the entire portal still at 15 (MRR 0.5).

248
Entire Portal: Arguments Accuracy@1 Mean Rank for religion and science, both of which share the
Theme Arguments w+ we↑ w+ we↑ characteristic of dealing with very specific topics.9
Culture 69 31.9 36.2 12 9
Digital freedoms 61 37.7 44.3 58 20 7 Conclusion
Economy 125 27.2 25.6 21 10
Education 81 38.3 39.5 36 17 This paper has asked how to find the best counterar-
Environment 46 17.4 21.7 22 7 gument to any argument without prior knowledge
Free speech debate 58 10.3 12.1 130 55 of the argument’s topic. We did not aim to engineer
Health 77 28.6 36.4 26 14
International 271 25.8 31.4 31 19 the best approach to this retrieval task, but to study
Law 134 38.8 38.1 16 8 whether we can model the simultaneous similarity
Philosophy 85 34.1 38.8 29 14 and dissimilarity of a counterargument to an argu-
Politics 202 28.7 33.2 28 11 ment computationally. For the restricted domain of
Religion 45 24.4 33.3 58 8
Science 57 19.3 28.1 6 5 debate portal arguments, our main result is quite
Society 60 16.7 20.0 45 22 intriguing: The best model (we↑ ) rewards a high
Sport 30 43.3 46.7 35 9
overall similarity to the argument’s conclusion and
All themes 1401 28.6 32.4 33 15 premises while punishing a too high similarity to
either of them. Despite its simplicity, we↑ found
Table 4: Accuracy@1 and mean rank of the best
the best counterargument among 2801 candidates
baseline (w+ ) and approach (we↑ ) on each theme
in almost a third of all cases, and ranked it into the
when all 2801 test set arguments are candidates.
top 15 on average. This speaks for our hypothesis
that the best counterargument often just addresses
Although our scoring model thus does not solve the the same topical aspects with opposite stance.
retrieval tasks, we conclude that it serves as a robust Of course, our hypothesis is simplifying, i.e.,
approach to rank the best counterargument high. there are counterarguments that will not be found
To test significance, we separately computed the based on aspect and stance similarity only. Apart
accuracy@1 for the arguments from each theme. from some hyperparameters, however, our model
The differences between the 15 values of the best is unsupervised and it does not make any assump-
approach on each task and those of the best baseline tion about an argument’s topic. Hence, it applies
(w+ , w× , or e× ) were normally distributed. Since to any argument, given a pool of candidate coun-
the baselines and approaches are dependent, we terarguments. While the model can be considered
used a one-tailed dependent t-test with paired sam- open-topic, a next step will be to study counterar-
ples. As Table 3 specifies, our approaches are con- gument retrieval open-source.
sistently better, partly with at least 99% confidence, We are confident that the modeled intuition gen-
partly even with 99.9% confidence. eralizes beyond idebate.org. To obtain further in-
In Table 4, we exemplarily detail the compari- sights into the nature of counterarguments, deeper
son of the best approach (we↑ ) to the best baseline linguistic analysis along with supervised learning
(w+ ) on Entire Portal: Arguments. The mean ranks may be needed, though. We provide a corpus to
across themes underline the robustness of we↑ , be- train respective approaches, but leave the according
ing in the top 10 for 7 and in the top 20 even for 13 research to future work.
themes. Still, the accuracy@1 of both w+ and we↑ The intended practical application of our model
varies notably, in case of we↑ from 12.1 for free is to retrieve counterarguments in automatic debat-
speech debate to 46.7 for sport. For free speech ing technologies (Rinott et al., 2015) and argument
debates (e.g., “This House would criminalise blas- search (Wachsmuth et al., 2017b). While debate
phemy”), we observed that their arguments tend portal arguments are often suitable in this regard,
to be overproportionally long, which might lead to in general not always a real counterargument exists
deviating similarities. In case of sports, the topical to an argument. Still, returning one that addresses
specificity (e.g., “This House would ban boxing”) similar aspects with opposite stance makes sense
reduces the probability of mistakenly choosing can- then. An alternative would be to generate counter-
didates from other themes. arguments, but we believe that humans are better
Free speech debate turned out the hardest theme than machines in writing them — currently.
in seven tasks, health in the remaining one. Besides 9
The individual results of the best approach and baseline
sports, in some tasks the best results were obtained on each theme are also found in the supplementary material.

249
References Oana Cocarascu and Francesca Toni. 2017. Identify-
ing attack and support argumentative relations us-
Yamen Ajjour, Wei-Fan Chen, Johannes Kiesel, Hen- ing deep learning. In Proceedings of the 2017 Con-
ning Wachsmuth, and Benno Stein. 2017. Unit seg- ference on Empirical Methods in Natural Language
mentation of argumentative texts. In Proceedings of Processing, pages 1374–1379. Association for Com-
the 4th Workshop on Argument Mining, pages 118– putational Linguistics.
128. Association for Computational Linguistics.
T. Edward Damer. 2009. Attacking Faulty Reasoning:
Khalid Al-Khatib, Henning Wachsmuth, Matthias Ha- A Practical Guide to Fallacy-Free Arguments, 6th
gen, Jonas Köhler, and Benno Stein. 2016. Cross- edition. Wadsworth, Cengage Learning, Belmont,
domain mining of argumentative text through dis- CA.
tant supervision. In Proceedings of the 2016 Con-
ference of the North American Chapter of the Asso- Vanessa Wei Feng and Graeme Hirst. 2011. Classify-
ciation for Computational Linguistics: Human Lan- ing arguments by scheme. In Proceedings of the
guage Technologies, pages 1395–1404. Association 49th Annual Meeting of the Association for Com-
for Computational Linguistics. putational Linguistics: Human Language Technolo-
gies, pages 987–996. Association for Computational
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- Linguistics.
tiani. 2010. SentiWordNet 3.0: An enhanced lexi-
cal resource for sentiment analysis and opinion min- Trudy Govier. 2010. A Practical Study of Argument,
ing. In Proceedings of the Seventh conference on 7th edition. Wadsworth, Cengage Learning, Bel-
International Language Resources and Evaluation mont, CA.
(LREC’10). European Languages Resources Associ- Ivan Habernal and Iryna Gurevych. 2016. What makes
ation (ELRA). a convincing argument? Empirical analysis and de-
tecting attributes of convincingness in web argumen-
Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din- tation. In Proceedings of the 2016 Conference on
uzzo, Amrita Saha, and Noam Slonim. 2017. Stance Empirical Methods in Natural Language Processing,
classification of context-dependent claims. In Pro- pages 1214–1223. Association for Computational
ceedings of the 15th Conference of the European Linguistics.
Chapter of the Association for Computational Lin-
guistics: Volume 1, Long Papers, pages 251–261. Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,
Association for Computational Linguistics. and Benno Stein. 2018. Before name-calling: Dy-
namics and triggers of ad hominem fallacies in web
Beata Beigman Klebanov, Eyal Beigman, and Daniel argumentation. In 16th Annual Conference of the
Diermeier. 2010. Vocabulary choice as an indica- North American Chapter of the Association for Com-
tor of perspective. In Proceedings of the ACL 2010 putational Linguistics: Human Language Technolo-
Conference Short Papers, pages 253–257. Associa- gies. Association for Computational Linguistics, to
tion for Computational Linguistics. appear.

Filip Boltužić and Jan Šnajder. 2014. Back up your Yufang Hou and Charles Jochim. 2017. Argument rela-
stance: Recognizing arguments in online discus- tion classification using a joint inference model. In
sions. In Proceedings of the First Workshop on Ar- Proceedings of the 4th Workshop on Argument Min-
gumentation Mining, pages 49–58. Association for ing, pages 60–66. Association for Computational
Computational Linguistics. Linguistics.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Liora Braunstain, Oren Kurland, David Carmel, Idan
Tomas Mikolov. 2017. Bag of tricks for efficient
Szpektor, and Anna Shtok. 2016. Supporting human
text classification. In Proceedings of the 15th Con-
answers for advice-seeking questions in CQA sites.
ference of the European Chapter of the Association
In Proceedings of the 38th European Conference on
for Computational Linguistics: Volume 2, Short Pa-
IR Research, pages 129–141.
pers, pages 427–431. Association for Computational
Linguistics.
Elena Cabrio and Serena Villata. 2012. Combining tex-
tual entailment and argumentation theory for sup- Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kil-
porting online debates interactions. In Proceed- ian Q. Weinberger. 2015. From word embeddings to
ings of the 50th Annual Meeting of the Associa- document distances. In Proceedings of the 32Nd In-
tion for Computational Linguistics (Volume 2: Short ternational Conference on International Conference
Papers), pages 208–212. Association for Computa- on Machine Learning - Volume 37, pages 957–966.
tional Linguistics.
John Lawrence and Chris Reed. 2017. Mining argu-
Sung-Hyuk Cha. 2007. Comprehensive Survey on mentative structure from natural language text using
Distance/Similarity Measures between Probability automatically generated premise-conclusion topic
Density Functions. International Journal of Math- models. In Proceedings of the 4th Workshop on Ar-
ematical Models and Methods in Applied Sciences, gument Mining, pages 39–48. Association for Com-
1(4):300–307. putational Linguistics.

250
Stefano Menini, Federico Nanni, Simone Paolo AAAI Conference on Artificial Intelligence, pages
Ponzetto, and Sara Tonelli. 2017. Topic-based agree- 4444–4451.
ment and disagreement in us electoral manifestos.
In Proceedings of the 2017 Conference on Empiri- Christian Stab and Iryna Gurevych. 2016. Recogniz-
cal Methods in Natural Language Processing, pages ing the absence of opposing arguments in persuasive
2938–2944. Association for Computational Linguis- essays. In Proceedings of the Third Workshop on
tics. Argument Mining (ArgMining2016), pages 113–118.
Association for Computational Linguistics.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
Christian Puhrsch, and Armand Joulin. 2017. Ad- Chenhao Tan, Vlad Niculae, Cristian Danescu-
vances in pre-training distributed word representa- Niculescu-Mizil, and Lillian Lee. 2016. Winning
tions. CoRR. arguments: Interaction dynamics and persuasion
strategies in good-faith online discussions. In Pro-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- ceedings of the 25th International World Wide Web
rado, and Jeffrey Dean. 2013. Distributed represen- Conference, pages 613–624.
tations of words and phrases and their composition-
ality. In Proceedings of the 26th International Con- Stephen E. Toulmin. 1958. The Uses of Argument.
ference on Neural Information Processing Systems - Cambridge University Press.
Volume 2, pages 3111–3119.
Henning Wachsmuth, Nona Naderi, Yufang Hou,
Vlad Niculae, Joonsuk Park, and Claire Cardie. 2017. Yonatan Bilu, Vinodkumar Prabhakaran, Tim Al-
Argument mining with structured SVMs and RNNs. berdingk Thijm, Graeme Hirst, and Benno Stein.
In Proceedings of the 55th Annual Meeting of the As- 2017a. Computational argumentation quality assess-
sociation for Computational Linguistics (Volume 1: ment in natural language. In Proceedings of the 15th
Long Papers), pages 985–995. Association for Com- Conference of the European Chapter of the Associa-
putational Linguistics. tion for Computational Linguistics: Volume 1, Long
Papers, pages 176–187. Association for Computa-
Andreas Peldszus and Manfred Stede. 2015. Towards tional Linguistics.
detecting counter-considerations in text. In Proceed-
ings of the 2nd Workshop on Argumentation Mining, Henning Wachsmuth, Martin Potthast, Khalid
pages 104–109. Association for Computational Lin- Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani
guistics. Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff,
and Benno Stein. 2017b. Building an argument
Peter Prettenhofer and Benno Stein. 2010. Cross- search engine for the web. In Proceedings of the
language text classification using structural corre- 4th Workshop on Argument Mining, pages 49–59.
spondence learning. In Proceedings of the 48th An- Association for Computational Linguistics.
nual Meeting of the Association for Computational
Henning Wachsmuth, Benno Stein, and Yamen Ajjour.
Linguistics, pages 1118–1127. Association for Com-
2017c. “PageRank” for Argument Relevance. In
putational Linguistics.
Proceedings of the 15th Conference of the European
Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Chapter of the Association for Computational Lin-
M. Mitesh Khapra, Ehud Aharoni, and Noam guistics: Volume 1, Long Papers, pages 1117–1127.
Slonim. 2015. Show me your evidence — An au- Association for Computational Linguistics.
tomatic method for context dependent evidence de-
Douglas Walton. 2006. Fundamentals of Critical Argu-
tection. In Proceedings of the 2015 Conference on
mentation. Cambridge University Press.
Empirical Methods in Natural Language Processing,
pages 440–450. Association for Computational Lin- Douglas Walton. 2009. Objections, rebuttals and refu-
guistics. tations. pages 1–10.
Haggai Roitman, Shay Hummel, Ella Rabinovich, Ben- Justine Zhang, Ravi Kumar, Sujith Ravi, and Cristian
jamin Sznajder, Noam Slonim, and Ehud Aharoni. Danescu-Niculescu-Mizil. 2016. Conversational
2016. On the retrieval of wikipedia articles contain- flow in Oxford-style debates. In Proceedings of the
ing claims on controversial topics. In Proceedings 2016 Conference of the North American Chapter of
of the 25th International Conference on World Wide the Association for Computational Linguistics: Hu-
Web, Companion Volume, pages 991–996. man Language Technologies, pages 136–141. Asso-
ciation for Computational Linguistics.
Parinaz Sobhani, Diana Inkpen, and Stan Matwin.
2015. From argumentation mining to stance classifi-
cation. In Proceedings of the 2nd Workshop on Ar-
gumentation Mining, pages 67–77. Association for
Computational Linguistics.

Robert Speer, Joshua Chin, and Catherine Havasi. 2017.


Conceptnet 5.5: An open multilingual graph of gen-
eral knowledge. In Proceedings of the Thirty-First

251

You might also like