SZZ Algorithms
SZZ Algorithms
net/publication/323843822
CITATIONS READS
94 2,640
3 authors:
Jesus M. Gonzalez-Barahona
King Juan Carlos University
222 PUBLICATIONS 4,161 CITATIONS
SEE PROFILE
All content following this page was uploaded by Gregorio Robles on 23 March 2018.
Abstract
Preprint submitted to Journal of Information and Software Technology March 10, 2018
1. Introduction
2 http://saner.unimol.it/negativerestrack
3 http://saner.unimol.it/index
2
may be a complex task, as access to data sources, use of specific tools, avail-
30 ability of detailed documentation has to be handled. Detecting factors that
hinder reproducibility should help strengthening the credibility of the empirical
studies [6].
This paper addresses how the scientific practice of the ESE research com-
munity affects the reproducibility and credibility of the results. In particular,
35 we want to address studies that use techniques or algorithms based on assump-
tions and heuristics. This is the case of the SZZ algorithm, published in 2005
in “When do changes induce fixes?” by Śliwerski, Zimmermann and Zeller [7]
at the MSR workshop4 . SZZ has been largely used in academia, counting, as of
February 2018, with more than 590 citations in Google Scholar5 .
40 In this paper, we present a Systematic Literature Review (SLR) on the
use and reproducibility of the SZZ algorithm in 187 academic publications.
Although this paper offers a case study of a widely used research technique,
it goes beyond credibility and reproducibility to offer a wider picture of lack
of high quality scientific practices. The credibility of results presented in ESE
45 research is negatively affected by this situation.
We make five significant contributions by presenting:
1. An overview on the impact that the SZZ has had so far in ESE.
3. An overview of how studies that use the SZZ algorithm address the repro-
50 ducibility in their research work.
4 MSR is today a working conference, but at that time it was a co-located workshop with
ICSE in its second edition.
5 https://scholar.google.es/scholar?cites=3875838236578562833
3
55 The remainder of this paper is structured as follows. First of all, we present
a detailed description of the SZZ algorithm in Section 2. Next, we describe the
method used in the SLR, with the research questions and inclusion/exclusion
criteria in Section 3. Section 4 describes how to extract the data, followed by
the results from systematically analyzing 187 publications in Section 5. Related
60 work is then presented in Section 6. We then discuss the implications and offer
lessons learned in Section 7. The threats to the validity of our study can be
found in Section 8. Finally, we draw conclusions and point out future research
in Section 9.
6 Publications included the Systematic Literature Review (SLR) are cited in this paper
using following format: [SLR#ref]
4
Figure 1: Example of changes committed in a file, the first change is the bug introducing
change and the third change is the bug fixing change.
5
Figure 2: First and Second part of the SZZ algorithm
6
Table 1: Limitations when using SZZ.
Part Type Description
Incomplete mapping [SLR[14] The fixing commit cannot be linked to the bug
First part The fixing commit has been linked to a wrong
Inaccurate mapping [SLR[15]]
bug report, they don’t correspond to each other
Systematic bias [SLR[14] Linking fixing commit with no real bug report
Cosmetic changes, comments,
. Variable renaming, indentation, split lines, etc.
blank lines [SLR[16]]
Added lines in fixing commits [SLR[8]] The new lines can not be tracked back
Second part Long fixing commits [SLR[8]] The larger the fix, the more false positives
Semantic level is weak [SLR[11]] Changes with the same behavior are being blamed
Correct changes at the time Changes in other parts of the source code base
of being committed [SLR[8]] trigger a bug issue in another part
Might hide the bug introducing commit,
Commit Squashing [17]
loosing authorship information
may also be that the buggy line was not analyzed by SZZ, producing a false
negative. In some cases, the bug had been introduced before the last change to
120 the line; then, the history of the line has to be traced back until the true source
of the bug is found [SLR[11]]. An example of this can be found when SZZ flags
changes to style (i.e., non-semantic/syntactic changes such as changes to white
spaces, indentation, comments, and some changes that split or merge lines of
code) as bug introducing changes [SLR[8]], or when a project allows commit
125 squashing, since this option removes authorship information resulting in more
false positives. It may also happen that the bug may have been caused by a
change in another part of the system [12]. A final possibility is that the bug fix
modified the surrounding context rather than the problematic lines, misleading
the algorithm [SLR[13]].
7
Da Costa et al. have created a framework to eliminate unlikely bug intro-
135 ducing changes from the outcome of SZZ. Their framework is based on a set of
requirements that consider the dates of the suspicious commit and of the bug re-
port [SLR[8]]. By removing those commits that do not fulfill these requirements,
the number of false positives provided by SZZ is lowered significantly.
The misclassification problem has been further investigated by researchers,
140 aiming at mitigating the limitations found in the first part of SZZ [SLR[9]],[SLR[18]],[SLR[10]].
Tools and algorithms have been created based on the information from version
control systems and issue tracking systems to map bug reports to fixing com-
mits [SLR[19]],[SLR[20]],[21], [22]. For instance, GitHub encourages linkage by
automatically closing issues if the commit message contains #number. Several
145 authors have suggested semantic heuristics [SLR[23]],[24],[SLR[25]]; others have
proposed solutions that rely on feature extraction from bugs and issue tracking
system metadata [SLR[19]],[SLR[20]],[SLR[21]]. In addition, many Free/Open
Source Software projects have adopted as a good practice to use the “# fix-
bug —” keywords in their commit comments when a bug is fixed, as it has been
150 reported for the Apache HTTP web server7 in [SLR[26]], and for VTK8 , and
ITK9 in [27].
As a result of these efforts, the first part of SZZ has seen how its accuracy
has significantly increased. For example, FRlink, an existing state-of-the-art
bug linking approach, has improved the performance of missing link recovery
155 compared to existing approaches, and it outperforms the previous one by 40.75%
(in F-Measure) when achieving the highest recall [22].
Related to the second part of SZZ, two main improvements have been pro-
posed in the literature. We call them SZZ-1 and SZZ-2 in this paper:
7 https://httpd.apache.org/
8 http://www.vtk.org/
9 https://itk.org/
8
annotate10 [SLR[16]].
The second part of the algorithm, however, has still room for further im-
provements. Addressing its limitations often requires a manual, tedious valida-
tion process.
3. Research Method
170 The purpose of a SLR is to identify, evaluate and interpret all available
studies relevant to a particular topic, research question, or effect of interest [28].
A SLR provides major information about the effects of a particular topic across a
wide range of previous studies and empirical methods. As a result, a SLR should
offer evidence with consistent results and suggest areas for further investigation.
175 We follow the SLR approach proposed by Kitchenham and Charters [28].
The aim of this SLR is to study and analyze the credibility and reproducibil-
ity of the SZZ algorithm. Therefore, we address following research questions
(RQs):
180 RQ1: What is the impact of the SZZ algorithm in academia?
We want to investigate the impact of SZZ in academia. We have divided
this RQ in several subquestions:
RQ1.1: How many publications use the complete SZZ algorithm?
9
Motivation. The SZZ algorithm has been shown to be a key factor to locate
185 when a change induced fixing commits. Many papers, however, use only the
first part of the algorithm to link bug fix reports to commits. As the second
part of SZZ is the one that shows to have significant threats, we identify those
publications that use both parts, or at least the second part, of the SZZ al-
gorithm. In addition to this, we offer other metrics on the publications, such
190 as the number of authors and the geographic diversity of the institutions they
work for, in order to provide insight of how widespread the use of SZZ is.
RQ1.2: How has the impact of SZZ changed over time?
Motivation. Our goal is to visualize the impact of SZZ over time, to see whether
SZZ has been only adopted in the first years after its publication or if it is still
195 widely in use nowadays.
RQ1.3: What are the most common venues with publications using
the SZZ algorithm?
Motivation. Our goal is to addresses the maturity and diversity of the publica-
tions where SZZ has been used in order to understand its audience. We address
200 the maturity of a publication by analyzing if it has been accepted in a workshop,
a conference, a journal, or a top journal. Diversity is given by the number of
distinct venues where publications using SZZ can be found.
RQ2: Are studies that use SZZ reproducible?
10
RQ3: Do the publications mention the limitations of SZZ?
220 Motivation. The improved versions of the original SZZ algorithm address some
of its limitations. We analyze whether any of the improvements to the SZZ
algorithm can be found in the primary studies included in the SLR. Thus, we
search for any mention of their use, be it in the description of the method or
in the threats to validity. Answering this research question allows to further
225 understand how authors who use SZZ behave given the limitations of SZZ.
RQ5: Does the reproducibility of the studies improve when au-
thors (1) report limitations or (2) use improved versions of SZZ?
11
240 3.2. Inclusion Criteria
After enumerating the research questions, we present the inclusion and ex-
clusion criteria for the SLR. In addition, we describe the search strategy used for
primary studies, the search sources and the reasons for removing papers from
the list.
245 The inclusion criteria address all published studies written in English that
cite either:
2. (at least) one of the two publications with improved versions of the algo-
250 rithm, “Automatic Identification of Bug-Introducing Changes” [SLR[16]]
and “SZZ Revisited: Verifying When Changes Induce Fixes” [SLR[11]].
There was no need to further investigate the references to the resulting set of
publications (a process known as snowballing): if one of these papers contained
as well a reference to the papers that fit the inclusion criteria, we suppose it is
255 already in our sample.
Before accepting a paper into the SLR, we excluded publications that are du-
plicates, i.e., a matured version (usually a journal publication) of a less matured
version (conference, workshop, PhD thesis...). In those cases, we only considered
the matured version. When we found a short and a long version of the same
260 publication, we have chosen the longer version. However, in those cases where
the publication is a PhD thesis and a related (peer-reviewed) publication exists
in a workshop, conference or journal, we have discarded the thesis in favor of
the latter, because conference and journal publications are peer-reviewed and
PhD theses not. Documents that are a false alarm (i.e., not a real, scientific
265 publication) have also been excluded.
The studies were identified using Google Scholar and Semantic Scholar as of
November 8th 2016. We have looked exclusively in Google Scholar and Semantic
12
Table 2: Number of citations of the SZZ, SZZ-1 and SZZ-2 publications by research databases.
Google Scholar Semantic Scholar ACM Digital Library CiteSeerX
# SZZ 493 295 166 26
# SZZ-1 141 100 60 18
# SZZ-2 26 15 8 0
3.3.1. Study Selection Criteria and Procedures for Including and Excluding Pri-
mary Studies
Table 3 shows that our searches elicited 1,070 citation entries. After applying
285 the inclusion criteria described above, we obtained a list of 458 papers. This
process was performed by the first author. The process is objective, as it involves
discarding false alarms, duplicates, and papers not written in English.
Then, the first author analyzed the remaining 458 papers looking for the
use of SZZ, SZZ-1 and SZZ-2 in the studies. This resulted in 193 papers being
290 removed because of three main reasons: i) they only cited the algorithm as part
11 https://core.ac.uk/download/pdf/9032200.pdf
12 http://ieeexplore.ieee.org/document/1382266/#full-text-section
13
Table 3: Number of papers that have cited the SZZ, SZZ-1 and SZZ-2 publications by joining
the research databases Google Scholar and Semantic Scholar during each stage of the selection
process.
Selection Process #SZZ #SZZ-1 #SZZ-2
Papers extracted from the databases 788 241 41
Sift based on false alarms 29 removed 10 removed 2 removed
Sift based on not available/English writing 40 removed 4 removed 0 removed
Sift based on duplicates 308 removed 187 removed 32 removed
Full papers considered for review 411 40 7
Removed after reading 149 removed 32 removed 4 removed
Papers accepted to the review 262 8 3
of the introduction or related work but they never used it, ii) they only cited the
algorithm to support a claim during their results or the discussion, and iii) the
papers were essays, systematic literature reviews or surveys. This process was
discussed in advance by all the authors. The second author partially validated
295 the process by analyzing a random subset comprising 10% of the papers. The
agreement between both authors was measured using Cohen’s Kappa coefficient,
which resulted in a value of 1 (perfect agreement). These papers were removed
on the basis that they do not answer our research questions. After this process
273 papers were included in this SLR.
This section explains how we obtained the data to show an overall picture
of credibility and reproducibility of the SZZ algorithm in ESE studies.
14
4.1.1. Phase 1: Establishing that the study uses the complete SZZ algorithm
310 In this SLR we only consider studies that use the complete algorithm, or at
least its second part. Even though limitations have been reported in both parts
of the SZZ algorithm (see Section 2), most of the limitations present in the first
part have been successfully addressed in the last years.
To analyze the ease of reproducibility of each study, based on our experience
315 in conducting similar research [3, 1], we looked for (1) a replication package
provided by the authors or (2) a detailed description. A detailed description
must have: (a) precise dates when the data were retrieved from the projects
under analysis, (b) the versions of the software and systems used in the analysis,
(c) a detailed description of the methods used in each phase of the study, and
320 (d) enumerate the research tools used. It should be noted that we did not
inspect whether the replication package is still available, or whether elements
in the package make the study reproducible. We assume that the authors and
the reviewers checked the availability of the replication package at the time of
submission. We do not claim the availability of these packages in the long term.
325 For instance the reproduction package from the original SZZ paper [7] is no
longer available.
Applying our criteria to the set of 273 papers, we obtain 187 papers that
fulfill this criterion.
330 We have read and analyzed the 187 papers, and extracted the following data
information in a first reviewing phase to answer the RQ1:
1. Title,
2. Authors,
15
6. Venue and class of publication (journal, conference, workshop or university
thesis).
2. For a detailed description of the methods and data used (as in [3]), to
answer RQ2.
4. Whether a manual inspection to verify the results has been done, to answer
345 RQ3.
The first author of this paper extracted data in the two phases. Then,
350 the second author of this paper randomly validated 10% of these results to
ensure that the articles included in the SLR were suitable. The agreement
using Cohen’s Kappa coefficient was 0.73 (good agreement). This coefficient
was computed to see whether a paper is using the SZZ algorithm or not, and to
identify which part is being used13 .
Cruzes and Dybå reported that synthesizing findings across studies is spe-
cially difficult, and that some SLRs in software engineering do not offer this
synthesis [30]. For this SLR we have extracted and analyzed both quantitative
and qualitative data from the studies, but we have not synthesized the studies,
13 The discordance was primarily because in some papers the description of the methodology
is spread over the paper, and some parts may have been omitted by one of us. This occurs
more often in those papers that do not explicitly mention the algorithm’s acronym (SZZ) or
support the description with a citation.
16
360 as they are too diverse. Doing a meta-analysis would offer limited and unstruc-
tured insight [31] and results would suffer from some of the limitations in SLRs
published in other disciplines [32]. Thus, we combined both our quantitative
and qualitative data to generate an overview of how authors have addressed
the reproducibility and credibility of the studies. The results are presented in
365 Section 5.
In addition, we have constructed a quality measure14 that assesses the ease
of reproducibility of a study. This measure is based on the score of five charac-
teristics of the papers that we have looked for in the second reviewing phase. If
questions were answered positively, the paper was marked with a positive score,
370 otherwise with 0:
375 4. Does the study provide detailed description of the methods and data used?
(score = 1 point)
5. Results
This section presents the results of our SLR. All the details, at the publi-
cation level and for each of the RQs, can be found in the on-line replication
14 The main goal of this quality measure is to determine the reproducibility and credibility
of the studies in the moment in which the study was submitted.
17
Table 4: Mapping of overall score and the quality measure on ease of reproducibility of a
study.
Score Quality Measure
0–1 Poor to be reproducible and to have credible results
2–4 Fair to be reproducible and to have credible results
5–6 Good to be reproducible and to have credible results
7 Excellent to be reproducible and to have credible results
18
Table 6: Results of the calculating the quality measure of reproducibility and credibility.
Quality Measure #papers
Poor to be reproducible and to have credible results 34 (18%)
Fair to be reproducible and to have credible results 126 (67%)
Good to be reproducible and to have credible results 24 (13%)
Excellent to be reproducible and to have credible results 3 (2%)
offering new tools (NT) and replications (R) offer slightly better results than
the rest.
Poor 15 (25%) 10 (19%) 6 (15%) 3 (9%) 14 (20%) 2 (13%) 17 (22%) 1 (7%) 0 (0%) 0 (0%)
Fair 38 (64%) 32 (62%) 29 (72%) 27 (77%) 49 (70%) 13 (81%) 50 (65%) 8 (53%) 3 (60%) 3 (75%)
Good 6 (10%) 9 (17%) 5 (13%) 4 (11%) 5 (7%) 1 (10%) 10 (13%) 5 (33%) 2 (40%) 1 (25%)
Excellent 1 (1%) 1 (2%) 0 (0%) 1 (3%) 2 (3%) 0 (0%) 0 (0%) 1 (7%) 0 (0%) 0 (0%)
15 We argue that master and PhD theses should be categorized as long publications. We have
chosen 50 as limit between medium and long papers because in our data we have observed
that all master theses have more than 50 pages whereas none of the journals articles have
more than 50 pages.
19
Table 8: Results of the measure the ease of reproducibility and credibility of the studies
depending on the type of paper and their size.
Quality Venue Size
Journal Conference Workshop University Short Medium Long
5.1.1. RQ1.1: How many publications use the complete SZZ algorithm?
Our final sample included 187 papers. These publications have been au-
thored by more than 370 different authors from institutions located in 24 dif-
415 ferent countries, offering evidence that SZZ is a widespread and well-known
algorithm.
5.1.2. RQ1.2: How has the impact of SZZ changed over time?
Figure 3 shows the evolution of the number of publications that have cited
and used SZZ, SZZ-1 or SZZ-2 up to November 2016. The SZZ algorithm was
420 published in 2005 and afterwards 178 studies have cited it. SZZ-1 was published
in 2006 and its number of citations is 53. Finally, SZZ-2 was published in 2008
and counts with 16 publications16 .
The number of studies per year peaked in 2013, with 30 papers using a
SZZ version. In general, since 2012 the number of studies using this algorithm
425 seems to have stabilized with over 15 citations/year for the use of the complete
algorithm.
5.1.3. RQ1.3: What are the most common venues with publications using the
SZZ algorithm?
Table 9 shows the different types of venues with publications where SZZ has
430 been used. We have classified the venues in four different categories: university
theses, workshop papers, conference and symposium publications, and journal
16 Note that a paper can cite more than one version of SZZ.
20
Figure 3: (RQ1.2) Sum of the number of publications using the (complete) SZZ, SZZ-1 or
SZZ-2 by year of publication (N=187).
articles. Master theses, student research competitions and technical reports have
been grouped under university theses. Diversity and maturity can be found in
the sample, as it can be seen from the number of different venues (second column
435 in Table 9) and the considerable number of journal publications (third column
in Table 9).
Table 10 offers further insight into those venues that have published more
studies that use SZZ. The most frequent one is the conference where SZZ it-
self was presented, the Working Conference on Mining Software Repositories
Table 9: (RQ1.3) Most frequent types of publications using (the complete) SZZ (N=187). #
different counts the different venues, # publications counts the total number of publications
in that type of venues.
Type # different # publications
Journals 21 42
Conferences & Symposiums 40 102
Workshops 13 13
University theses 20 30
21
Table 10: (RQ1.3) Most popular media with publications using SZZ, SZZ-1 and SZZ-2
(N=187). “J” stands for journal and “C” for conference/symposium.
Type Name Rating # papers)
C Conf Mining Softw Repositories (MSR) Class 2 - CORE A 15 (8%)
C Intl Conf Software Eng (ICSE) Class 1 - CORE A* 12 (6%)
C Intl Conf Soft Maintenance (ICSME) Class 2 - CORE A 10 (5%)
J Empirical Software Eng (EmSE) JCR Q1 9 (5%)
J Transactions on Software Eng (TSE) JCR Q1 9 (5%)
C Intl Symp Emp Soft Eng & Measurement (ESEM) Class 2 - CORE A 8 (4%)
C Intl Conf Automated Softw Eng (ASE) Class 2 - CORE A 7 (4%)
C Symp Foundations of Software Eng (FSE) Class 1 - CORE A* 6 (3%)
440 (MSR). Two top conferences, such as the International Conference on Software
Maintenance and Evolution (ICSME) and the International Conference of Soft-
ware Engineering (ICSE), are second and third. SZZ can also been frequently
found in high quality journals, such as Empirical Software Engineering (EmSE)
and Transactions on Software Engineering (TSE). The quality rating of con-
445 ferences given in Table 10 has been obtained from the GII-GRIN-SCIE (GGS)
Conference Rating17 ; Class 1 (CORE A*) conferences are considered excellent,
top notch events (top 2% of all events), while Class 2 (CORE A) are very good
events (given by the next top 5%). For journals, we offer the quartile as given
by the well-known Journal Citation Reports (JCR) by Clarivate Analytics (pre-
450 viously Thomson Reuters).
RQ1: The impact of the SZZ algorithm is significant: 458
publications cite SZZ, SZZ-1 or SZZ-2; 187 of these use the
complete algorithm. The popularity and use of SZZ has risen
quickly from its publication in 2005 and it can be found in all
types of venues (high diversity ), ranging from top journals
to workshops and PhD theses; SZZ related publications have
often been published in high quality conferences and top
journals (high maturity ).
17 http://gii-grin-scie-rating.scie.es/
22
Table 11: (RQ2) Publications by their reproducibility: Rows: Yes means the number of
papers that fulfill each column, whereas the complement is No. Columns: Package is when
they offer a replication package, Environment when they provide a detailed methodology and
dataset. Note that Both is the intersection of Package and Environment. (N=187)
Package Only Environment Only Both None
Yes 19 72 24 72
No 168 96 163 115
23
Table 12: (RQ3) Number of publications that mention limitations of SZZ in their Threats
To Validity (TTV). Mentions can be to the first (TTV-1st ), second (TTV-2nd ) or both parts
(Complete-TTV). The absence of mentions is classified as No-TTV. Note that Complete-TTV
is the intersection of TTV-1 and TTV-2.
No-TTV TTV-1st only TTV-2nd only Complete-TTV
Yes 94 44 10 39
No 93 143 177 148
470 We have classified publications in four groups, depending on how they ad-
dress limitations in SZZ as a threat to validity (TTV ). Thus, we have publica-
tions that i) mention limitations of the complete algorithm (Complete-TTV ),
ii) mention only limitations in the first part (TTV-1st ), ii) mention only limita-
tions in the second part (TTV-2nd ), and iv) do not mention limitations at all
475 (No-TTV ).
Table 12 offers the results of our analysis. From the 187 publications, only
39 mention limitations of the complete SZZ as a threat to validity, whereas 83
refer to limitations in the first part, and 49 do it for the second part. The rest,
94 studies, do not mention any limitation.
480 In a more profound review, we found 82 publications where a manual in-
spection had been done to assess these limitations: 33 of them referred to issues
related to the first part of the SZZ algorithm, while 30 analyzed aspects from
the second part (i.e., the bug introducing changes). In the remaining 19 papers,
the manual validation of results did not focus on outputs of the SZZ algorithm.
RQ3 (a): Almost half (49.7%) of the analyzed publications
mention limitations in the first or second part of SZZ as a
485
threat to validity. Limitations to the first part are reported
more often than to the second part.
5.4. RQ4: Are the improvements to SZZ (SZZ-1 and SZZ-2) used?
24
Our analysis tries to find out how often they have been used.
490 It is difficult to determine which improvement has been used when the au-
thors do not mention it in the publication. Thus, if the authors do not explicitly
specify having used an improvement, we assume that they use the original ver-
sion of SZZ. We have classified the publications into one of the next groups,
depending of the kind of improvement they use, as follows:
495 • original SZZ : Those only citing the original version and not mentioning
improvements.
• SZZ-2 : Those citing the improved version of Williams and Spacco [SLR[11]].
• SZZ-mod : Those citing the original SZZ with some (own) modification
500 (by the authors). Publications in this group contain statements like “we
adapt SZZ”, “the approach is similar to SZZ’ or “the approach is based
on SZZ”, but do not refer explicitly to SZZ-1 or SZZ-2.
25
Table 13: (RQ4) Number of papers that have used the original SZZ, the improved versions of
SZZ or some adaptations to mitigate the threat.
Original SZZ only SZZ-improved only SZZ-mod only Mixed
a
# publications 71 (38%) 26 (14%) 75 (40%) 15 (8%)
a 22 (12%) of the papers use SZZ-1 and only 4 (2%) of the papers use SZZ-2.
5.5. RQ5: Does the reproducibility of the studies improve when authors (1)
report limitations or (2) use improved versions of SZZ?
515 Finally, we want to see if we can find a relationship between some of the
characteristics studied so far. In particular, between i) reproducibility and re-
porting of limitations, ii) reproducibility and the use of an improved version of
SZZ, and iii) the SZZ version and reporting of limitations of SZZ. It should be
noted that we measure the association between these variables, i.e., causation
520 cannot be inferred from our results.
Hypothesis formulation
• Hypothesis10 : There is no association between the reproducibility of the
papers and reporting limitations of SZZ.
For all the above hypotheses, the alternative hypothesis is their negation.
Variables
530 Reproducibility is a categorical variable that can take three values:
26
535 3. FullReproduction, for publications that provide both (replication package
and detailed methodology and data).
Version of SZZ used in the studies is a categorical variable that can take
three different values:
540 2. SZZ-Mod, for publications that use the original SZZ with some own mod-
ification(s)
Design
550 The data we have collected can be aggregated and jointly displayed in a
tabular form to find out associations and interactions between variables. Ta-
bles 14, 15 and 16 show the cross tabulation between variables.
To test hypotheses H10 , H20 , and H30 , we use a Chi-Square test which can
be used when there are two categorical variables, each with two or more pos-
555 sible values. If we cannot reject the null hypothesis, we conclude that there is
no association between variables. When we can reject the null hypothesis, we
can conclude that there is an association between variables. As is customary,
the tests will be performed at the 5% significance level. Furthermore, we should
consider the multiple testing problem which implies that when multiple hypoth-
560 esis are tested, the chance of a rare event increases and, as a consequence, the
probability of incorrectly rejecting a null hypothesis increases as well [33]. Thus,
27
Table 14: Cross Tabulation between the categorical variables Reproducibility and Reporting
for hypothesis H10
Full Repro Partial Repro No Repro Total
R+ 10 18 11 39
R- 9 45 40 94
Total 19 63 51 133
Table 15: Cross Tabulation between the categorical variables Reproducibility and version of
SZZ for hypothesis H20
Full Reprodu Partial Repro No Repro Total
SZZ Original 13 39 19 71
SZZ-Improved 2 10 13 25
SZZ-Mod 7 38 30 75
Total 22 87 62 171
Table 16: Cross Tabulation between the categorical variables Reporting and version of SZZ
for hypothesis H30
SZZ original SZZ-Improved SZZ-Mod Total
R+ 16 10 9 35
R- 32 9 47 88
Total 48 19 56 123
Table 17: (RQ5) P-values for each of the hypothesis. Note that after the Bonferroni correction,
the significance level is 0.017 (0.05/3).
Hypothesis10 Hypothesis20 Hypothesis30
p-value 0.0905 0.1162 0.0059
28
0.017. In other words: we have not found association between reproducibility
570 and reporting limitations (H10 ) and reproducibility and the version of SZZ used
(H20 ). So, contrary to our expectations, papers who follow a better approach
in one aspect do not necessarily have to follow a better approach in the other
aspect.
On the other hand, we can reject the null hypothesis H30 . We expected to find
575 association between version of SZZ used and reporting limitations. However,
Table 16 shows a trend in the opposite direction than the one we assumed, as
in particular proportionally few papers that use ad-hoc modifications on top of
SZZ (SZZ-mod ) report limitations. This finding can be seen paradoxical at first,
since to enhance an algorithm you first have to be aware of its problems; but
580 what we have found is that those who create their own enhanced versions are
less prone to report limitations, i.e., they may overestimate the reach of their
solution (and, thus, do not report its limitations).
When rejecting the null hypothesis, we also need to understand how these
variables are related. This is done by observing the standardized Pearson resid-
585 uals, which measure the difference between the observed and expected frequen-
cies. When a standardized residual for a category is greater than 2 (in absolute
value), we can conclude that it is a major contributor to the significant Chi-
square distribution. None of the 21 residual values meets that condition18 .
RQ5: Reporting limitations of SZZ and the use of improved
versions of SZZ are not associated with a higher repro-
ducibility of the papers. The use of an improved version
of SZZ is associated with reporting limitations, although
not in the direction we initially expected; especially authors
who perform ad-hoc improvements report limitations less
frequently.
18 The tables with all residuals can be found in the replication package.
29
590 6. Related Work
30
620 data types and techniques used, in addition to an evaluation of approaches,
opportunities and challenges [SLR[35]]. Amann et al. review studies published
in the top venues for Software Engineering research (such as ICSE, ESEC/FSE
and MSR) with the purpose of describing the current state of the art, mined
artifacts, pursued goals and the reproducibility of the studies. Their findings
625 show, in line with then ones presented in this paper, that only 40% of the studies
are potentially replicable [SLR[36]].
Bug Reports. Zhang et al. review the work on bug report analysis and present
an exhaustive survey. They give some guidance to cope with possible problems
and point out the necessity to work on bug report analysis because none of the
630 existing automatic approaches has achieved a satisfactory accuracy [SLR[37]].
Other authors, such as Bachmann and Bernstein, have performed a survey on
five open source software projects and one closed source software project in
order to understand the quality and characteristics of the data gathered from
issue tracking databases. This study shows that some projects present bad bug
635 report links to commits, and that the process of bug fixing could be designed
more efficiently [SLR[38]].
Defect Prediction. Jureczko and Madeyski carried out a survey on process met-
rics in defect prediction [SLR[39]]. The authors discuss some of those metrics
such as number of revisions, number of distinct committers, number of modified
640 lines, among others, and also present a taxonomy of the analyzed process met-
rics, showing that process metrics can be an effective addition to software defect
prediction models. Nam has performed a survey on software defect prediction,
but he focuses on the discussion of various approaches, applications and other
emerging topics, concluding with the identification of some challenging issues
645 in defect prediction [SLR[40]]. Hall et al. performed a SLR to understand how
some variables such as the context of models, the independent variables used and
the modeling techniques applied, influence the fault prediction models. Their
results indicate that models based on simple modeling techniques such as Naı̈ve
Bayes or Logistic Regression perform well [SLR[41]].
31
650 7. Discussion
In this paper we have studied the use of SZZ, a widely used algorithm in
ESE research. We have shown that SZZ is certainly relevant, not limited to a
niche audience, and can be found in publications in top journals and prominent
conferences. In this regard, we can see its study as a case study of how a software
655 engineering practice spreads across academia.
We have observed that limitations to SZZ are well known and documented.
Improvements have been proposed, with unequal success up to the moment.
While the limitations to its first part –related to linking fix commits and bug
tracking issues– has improved significantly, the enhancements for the second
660 part –that has to do with finding the bug introducing change– are still limited
and accuracy has room for improvement.
Even if limitations have been widely documented, from our study we can
see that this has not made ESE practices stronger. From the detailed study
of the threats to validity of publications using SZZ, SZZ-1 and SZZ-2, we have
665 seen that most publications are not reporting the limitations, and interestingly
enough, limitations to the first part –which have shown to be less relevant– are
discussed more often than those to the second part. The fact that 38% of the
publications use the original SZZ is indicative in this regard.
We have found that reproducibility of the publications is limited, and repli-
670 cation packages are seldom offered. The results presented in our research are in
line with previous research [SLR[36]], although not as bad as the ones found for
MSR in 2010 [3]. In any case, we think they are not satisfactory enough, and
should raise some reflections about the scientific methods used in our discipline.
Even if using one of the improved versions of the algorithm helps in the
675 accuracy of the SZZ approach as pointed out in [SLR[42]]: “Accuracy in identi-
fying bug introducing changes may be increased by using advanced algorithms
(Kim et al. 2006, 2008)”, they are seldom used – only 14% of the publications
use one of the two (improved) revisions of SZZ. It seems that researchers pre-
fer to reinvent the wheel, 40% use an ad-hoc modified version of SZZ in their
32
680 publications, than to use others’ improvements. One possible reason for this
is that papers that describe the SZZ algorithm or any of its improvements do
not provide a software implementation. Thus, researchers have to implement it
from scratch for their investigation. Another possible reason can be because it
exits a lack of awareness about SZZ-1 and SZZ-2. Our results show that in such
685 a situation what researchers do is to take the base SZZ algorithm and then add
some modifications, resulting in an ad-hoc solution. For all ad-hoc solutions
identified, we have not found a rationale of why other enhancements to SZZ
have not been implemented. Another major problem when using improvements
to SZZ is that they have not been given a version/label. Even if a revision of
690 SZZ is used, publications often refer to it as SZZ, making it difficult to follow,
to reproduce, to replicate and to raise awareness on this issue.
We have observed that ease of reproducibility are rarely found in the stud-
ied publications; we could classify only 15% of the papers as being of good or
excellent quality with respect to reproducibility. The research community should
695 direct more attention to these aspects; we believe that too much attention is put
on the final result(s) (the product of the research: new knowledge) in comparison
to the research process. As researchers in the field of software engineering, we
know that both –a high quality product and a high quality process- are essential
for a successful advancement in the long term [43].
700 All these factors undermine the credibility of ESE research, and require a
profound consideration by the research community.
33
710 description of the elements, methods and software used during the study is also
valuable.
On the other hand, to provide more trustable results, we recommend that re-
searchers specify (and argue) the use of those methods/algorithms that mitigate
the limitations of their studies, be aware of the risk of every assumptions that
715 is being used and, if needed, provide a manual analysis of the results. For those
studies where the study size is large, researchers can select a random sample to
validate it manually.
As it is the case in software projects with release numbering [44], it would
be desirable to have a similar mechanism for software implementations used in
720 such type of research, although this may not always be possible given the decen-
tralized nature of research. We, therefore, recommend researchers who develop
modifications to the SZZ algorithm to publish the software implementation in
development sites such as GitHub, so other researchers can fork the project.
These forks could be easily traced, and the authors could ask for a specific
725 citation to their solution if other researchers make use of it.
Finally, we offer a simple way to measure the ease of reproducibility and
credibility of research papers. Even if this measure has been conceived with
those studies that make use of SZZ in mind, we think it can be adapted to
other ESE research easily. Thus, authors can easily assess if their paper offers
730 a reproducible and trustable work (i.e., with scores above or equal to 5).
We should not forget the responsibility of reviewers in the scientific process.
We have seen that authors may often be short-term focused when presenting
their research. We have found that reproducibility is not associated with report-
ing limitations or the use of improved versions of SZZ. Reviewers should have
735 the required vision to evaluate the studies having a long-term perspective that
is beneficial for the scientific community, helping authors to raise the level of
their research. Thus, we recommend reviewers to ask themselves the questions
proposed in Section 4.3, adapted to the context of the research, when reviewing
publications that are based on heuristics and assumptions.
34
740 8. Threats to validity
Wohlin et al. discuss four main types of validity threats in ESE research:
conclusion, internal, construct and external [45].
Conclusion validity, being related to how sure we can be that the treatment
we used in an investigation is related to the actual outcome we observed, does
745 not affect our approach.
Internal validity is the extent to which a causal conclusion based on a study
is warranted, which is determined by the degree to which a study minimizes
systematic errors. We have attempted to minimize this threat by following
the procedures for performing SLRs described in [46], and offer a replication
750 package so that third parties can inspect our sources. However, there might be
a selection bias due to having chosen Google Scholar and Semantic Scholar as
the source of all publications on SZZ; other publications may exist that are not
indexed by Google Scholar or Semantic Scholar. In addition, other publications
could make use of SZZ and not cite the original publication or its improvements.
755 Another factor affecting internal validity may be the maturity of the field, in
the sense that our study may have been done in a too early a stage to draw
conclusions. We think, however, that more than 10 years is enough time to
allow to extract valid lessons from its study.
Construct validity is the degree to which an investigation measures what it
760 claims to be measuring. In this paper, we measure impact by number of publi-
cations and types and diversity of venues, reproducibility by the availability of
a replication package or of a detailed description and data set, and have manu-
ally analyzed hundreds of papers for specific, sometimes very detailed aspects,
so human errors may have occurred. We think that the effect on impact is
765 small, given the large number of publications and venues that we have found.
The effect on reproducibility is to be considered, as we have not reproduced or
replicated the studies by ourselves; in this regard, we offer an upper limit of
publications that may be reproducible. Our replication package does not re-
move the human errors that we may have incurred, but it offers the possibility
35
770 to others to check and improve our work.
External validity is the degree to which results can be generalized to other
contexts. In this paper, we have selected a single case study, with its particu-
larities and peculiarities. We cannot claim that our results can be generalized
to ESE research. However, the value of case studies should be not undermined;
775 Flyvbjerg provides several examples of individual cases that contributed to dis-
coveries in physics, economics, and social sciences [47], while Beveridge observed
for social sciences: “More discoveries have arisen from intense observation than
from statistics applied to large groups” (as quoted in [48], page 95).
36
how it should be addressed by the ESE community. Additional case studies
800 should be carried out on other empirical techniques to ascertain if our findings
hold in other scenarios and contexts, for instance, when confidential data are
involved.
12. References
810 Following the guidelines described in [SLR[41]], we have cited those pub-
lications using the [SLR nn] format if they were one of the (458) full papers
considered for review in the SLR; otherwise they have been cited using the nor-
mal format [#ref]. The complete list of publications of the SLR can be found in
the replication package. There, each reference is followed by a code indicating
815 the status of the paper, whether it passed (P) or failed (F) our criteria to be
included in the SLR. In case the paper failed our assessment, we also indicate
in which phase (1, 2).
37
825 [3] G. Robles, Replicating MSR: A study of the potential replicability of pa-
pers published in the Mining Software Repositories proceedings, in: Min-
ing Software Repositories (MSR), 2010 7th IEEE Working Conference on,
IEEE, 2010, pp. 171–180.
[9] K. Herzig, S. Just, A. Zeller, It’s not a bug, it’s a feature: how misclassi-
fication impacts bug prediction, in: Proceedings of the 2013 International
845 Conference on Software Engineering, IEEE Press, 2013, pp. 392–401, (P).
850 [11] C. Williams, J. Spacco, Szz revisited: verifying when changes induce fixes,
in: Proceedings of the 2008 workshop on Defects in large software systems,
ACM, 2008, pp. 32–36, (P).
38
[12] D. M. German, A. E. Hassan, G. Robles, Change impact graphs: Determin-
ing the impact of prior codechanges, Information and Software Technology
855 51 (10) (2009) 1394–1408.
[17] G. Gousios, The ghtorent dataset and tool suite, in: Proceedings of the
10th Working Conference on Mining Software Repositories, IEEE Press,
2013, pp. 233–236.
[18] M. Tan, L. Tan, S. Dara, C. Mayeux, Online defect prediction for im-
875 balanced data, in: Proceedings of the 37th International Conference on
Software Engineering-Volume 2, IEEE Press, 2015, pp. 99–108, (P).
[19] R. Wu, H. Zhang, S. Kim, S.-C. Cheung, Relink: recovering links between
bugs and changes, in: Proceedings of the 19th ACM SIGSOFT symposium
and the 13th European conference on Foundations of software engineering,
880 ACM, 2011, pp. 15–25, (F,1).
39
[20] A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, T. N. Nguyen, Multi-layered
approach for recovering links between bug reports and fixes, in: Proceedings
of the ACM SIGSOFT 20th International Symposium on the Foundations
of Software Engineering, ACM, 2012, p. 63, (F,1).
40
[28] B. Kitchenham, S. Charters, Guidelines for performing systematic litera-
ture reviews in software engineering (2007).
925 [35] W. Jung, E. Lee, C. Wu, A survey on mining software repositories, IEICE
TRANSACTIONS on Information and Systems 95 (5) (2012) 1384–1406,
(F,1).
41
[38] A. Bachmann, A. Bernstein, Software process data quality and characteris-
935 tics: a historical view on open and closed source projects, in: Proceedings of
the joint international and annual ERCIM workshops on Principles of soft-
ware evolution (IWPSE) and software evolution (Evol) workshops, ACM,
2009, pp. 119–128, (F,1).
42