Supplementary-Material-Citation-Context Analysis
Supplementary-Material-Citation-Context Analysis
(2012)
Citing a document is an act of symbol usage. Cited documents symbolize ideas, methods or broader claims, to those who cite them,
and over time, as citations accrue, cited documents may become “concept symbols”. Citers tend to give earlier works consensual
meaning by “piling up” identical, similar or synonymous words and phrases in the sentences in which their citing markers are
embedded. Hence, a “uniformity of usage” may evolve over time, from early, more diverse and elaborate citing, to later, less
elaborate, but more general and uniform citing. Developments towards “uniformity of usage” are obviously influenced by citers being
“inspired” by what previous citers wrote. Eventually, a document may come to symbolize concepts, which its authors did not
anticipate. It may be a method, a claim, or a general idea or topic. However, we also see that some documents have “split identities”,
so that they represent different “concept symbols” in different domains of citers.
The John study from 2012 is highly cited. It attained that status rather quickly and it is therefore likely that during this period, it has
also become a concept symbol to many of its citers. Becoming a concept symbol takes time. Initial citing will be more diverse and
evidently be based on abstract or full text reading of the cited document. As citing documents are published and citing contexts
become visible to future citers, the process of “uniformity of usage” may evolve. It is therefore relevant to examine early citing
contexts in order to understand the symbol usage here, and then compare it to the latest citing contexts in order to examine the extent
to which the cited document can be seen as a concept symbol. At the same time, it is also interesting to examine the meaning of the
potential concept symbol and relate this to the writing and claims in the source document.
In the manuscript, we have already outlined the content of the John study. Here we simply reiterate that the main claim seems to be
anchored in the following two sentences from the abstract (see Table S1): “the study found that the percentage of respondents who
engaged in questionable practices was surprisingly high” and that “this finding suggests that some questionable practices may
constitute the prevailing research norm”. We claim that the authors signal “widespread use of QRPs among their respondents”,
implicitly taken to be psychologist. But the study is clearly more subtle, examining ten practices with varying prevalence using a
sophisticated method. It is therefore interesting to map the potential diversity of symbol usage in the early citing contexts and
juxtapose it with usage in the latest contexts, where a new factor has emerged, namely the criticism from Fiedler and Schwarz (2016)
questioning the main claims referred to above.
Table S1.1: Abstract from John et al. (2012)
Abstract
Cases of clear scientific misconduct have received significant media attention recently, but less flagrantly
1
questionable research practices may be more prevalent and, ultimately, more damaging to the academic
enterprise. Using an anonymous elicitation format supplemented by incentives for honest reporting, we surveyed
over 2,000 psychologists about their involvement in questionable research practices. The impact of truth-telling
incentives on self-admissions of questionable research practices was positive, and this impact was greater for
practices that respondents judged to be less defensible. Combining three different estimation methods, we found
that the percentage of respondents who have engaged in questionable practices was surprisingly high. This
finding suggests that some questionable practices may constitute the prevailing research norm.
Method
We analyze the citing contexts using a simple categorization scheme (Table S2). The categorization process has been bottom-up, so
that categories have been established and adjusted along the way. We use eight categories. We divide six of the categories into three
levels, where Level 1 is the most specific, where the citing context engages directly with the John survey (SS), or an individual
practice (PC), or some other specific feature of the paper not related to the prevalence findings (O). Level 2 is for “broader claims”
(BC). These turned out to be citing contexts that addressed consequences of (specific) practices leading to false-positive claims, or
increased chances for significant findings or publication. Notice, it is characteristically for the category that it predominantly addresses
consequences, not prevalence. This category is also characterized by frequent co-citing with other papers and therefore the use of
near-synonymous terms, especially when it comes to “practices”, which are used interchangeably with “researcher-degrees-of-
freedom” or “analytical choices”. Among the 2012-13 citing documents, the John study is co-cited with Simmons, Nelson and
Simonsohn (2011) in 32 out of 42 documents. Many of these citing contexts appear in this “broader category” albeit this is not
surprising, as the Simmons study is strongly related to the John study, and it coined the term “researcher degrees of freedom”. Level 3
contains “general claims”, or what may be considered over time to be “concept symbols”. We have listed two such general claims, the
first one (GC1) is directly related to the main claim coming out of the John study, which basically states that questionable research
practices are widespread. This is a general claim about prevalence of QRPs. The question is to what extent the citing context reflects
upon the study population of the John study, consisting of American psychological researchers, or to what extent is simply generalizes
the prevalence claim beyond psychology. The second general claim is even broader (GC2). Here prevalence is not addressed, and the
John study simply becomes a symbol for the concept of “questionable research practices”, even though the authors did not coin this
concept. Finally, we have two other categories, one that contains citing contexts that seem to be some cognitive distance away from
the John study(O); and a category which is only relevant for the 2020 contexts as it addresses the co-citing of the John and Fiedler
studies, where the latter is critical of the former’s main claims and therefore seems to be highly relevant when GC1 is invoked.
Table S1.2: Concept symbol coding scheme
2
Level Abbreviation Code Meaning
Level 3 GC2 General claim 1 “Questionable research practices”
GC1 General claim 2 “Widespread use of questionable research practices”
Level 2 BC Broader claims “consequence of use of practices (researcher-degrees-of-freedom; analytical choices) that lead to false-
positives, or increase chance of significance and publication”
Level 1 PC Specific practice “Specific focus of an individual practice”
claim
SS Survey specific “Specific citing of the survey and its results”
claims
O Other Other refers to citing contexts that does not address the findings in John, e.g. survey methodology
No level D Distant Citing contexts that seem to be some cognitive distance away from the John study
CRIT Critical Only relevant for the 2020 citing contexts as it addresses the co-citing of the John and Fiedler studies,
where the latter is critical of the former’s main claims
We examine the distribution of codes across citing contexts between two time frames, 2012-13 and 2020. We include all citing
documents form 2012-13 (40, 67 contexts) and a random sample from 2020 (21 out of 87, 30 contexts). We are interested in
examining potential shifts among the category levels, where we would expect citing contexts to become more general over time if the
John study is becoming a concept symbol to many of those who cite it. The results are presented in Table S3 below.
Table S1.3. Results
2012-13 2020
Level 3 GC2 3 0.04 4 0.13
GC1 11 0.16 10 0.33
Total 67 30
Results
3
The results do suggest that over time the John study is becoming a concept symbol. One third of the citing contexts in 2020 refer to the
“widespread use of questionable research practices”, and slightly more than one in ten refer simply to “questionable research
practices”. Many contexts do not specifically refer to psychology, but most citing documents are published in psychological journals
so this restriction may be implicit. Combined, Level 3 constitutes almost half of all citing contexts in 2020, that fraction was one in
five in 2012-13. The shift demonstrates that early on, citing contexts are more diverse and spread out between the three levels. What
seems to have happened is that Level 2 has shrunk to almost nothing, presumably, because this “broader context” has seen a gradual
“uniformity of usage” moving towards an even more general symbolic use of the John study, . When comparing co-citing with the
Simmons study in 2020, the terms used for the John study are clearly referring to prevalence of QRPs or simply QRPs. What becomes
clear is that the John and Simmons studies go hand-in-hand as they are co-cited very frequently (32 out of 42 in 2012-13 and 427 out
of 691 in total). This suggests that these papers symbolize a single broader narrative.
Another interesting pattern is that the frequency of Level 1 citing contexts seems to be rather consistent over time. This means that
while cited documents may become concept symbols, and as such are typically cited in the introduction or review sections of a paper,
they are also still cited for specific reasons, which indicate a direct involvement with parts of the study. Finally, in the 2020 citing
contexts, four out of thirty co-cite the John and Fiedler studies (C). Interestingly, ten citing contexts (GC1) in 2020 perpetuate the
main claim from the John study that “the use of QRPs is widespread” without co-citing the Fiedler study that seriously disputes this
claim. Compared to the four critical contexts, this seem to be a signal that the main claim of the John study has become a meme that
“lives a life of its own” due to biased citing practices. It seems evident that the main claim of “widespread use of QRPs” cannot stand
alone. If it does, the claim is flawed and the citing practice is distorted. It is also interesting that the “general claims” we find in the
citing contexts: “widespread use of QRPs” (GC1) and “questionable research practices” (GC2) both appear in the abstract of the John
paper..
4
percent admitted to having only report studies that “worked” (which we take to imply p < .05), whereas
57% acknowledged to having used sequential testing (cf. Wagenmakers, 2007) in their work.
1 2 Strategy 2. Perform one large study and use some of the QRPs most popular in psychology (John et al., GC1
2012). These QRPs may be performed sequentially until a significant result is found: a. Test a second
dependent variable that is correlated with the primary dependent variable (for which John et al. found a
65% admittance rate)
1 3 Even those who ignore p values of individual studies will find inflated ESs in the psychological literature GC1
if a sufficient number of researchers play strategically, which indeed many psychological researchers
appear to do (John et al., 2012).
2 4 Discussions have focused on aspects such as incentives in psychological research (e.g., Fanelli, 2010; D
John, Loewenstein, & Prelec, 2012), the review process (e.g., Wicherts, Kievit, Bakker, & Borsboom,
2012), replicability (e.g., Hartshorne & Schachner, 2012; Yong, 2012), publication bias (e.g., Fanelli,
2011; Francis, 2012; Renkewitz, Fuchs, & Fiedler, 2011), statistical methods and standards (e.g.,
Matthews, 2011; Wetzels et al., 2011), and scientific communication (e.g., Nosek & Bar-Anan, in press).
4 5 For those psychologists who expected that the embarrassments of 2011 would soon recede into memory, GC1
2012 offered instead a quick plunge from bad to worse, with new indications of outright fraud in the field
of social cognition (Simonsohn, 2012), an article in Psychological Science showing that many
psychologists admit to engaging in at least some of the QRPs examined by Simmons and colleagues
(John, Loewenstein, & Prelec, 2012), troubling new meta-analytic evidence suggesting that the QRPs
described by Simmons and colleagues may even be leaving telltale signs visible in the distribution of p
values in the psychological literature (Masicampo & Lalande, in press; Simonsohn, 2012), and an
acrimonious dust-up in science magazines and blogs centered around the problems some investigators
were having in replicating well-known results from the field of social cognition (Bower, 2012; Yong,
2012).
5 6 The increase in the rejection rate introduced by data peeking may not seem like much, but Simmons et al. BC
(2011) described several other tricks that can also inflate the rejection rate, and John, Loewenstein, and
Prelec (2012) reported evidence that experimental psychologists use some of these tricks
6 7 Two recent articles have highlighted the possibility that research practices spuriously inflate the presence BC
of positive results in the published literature (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, &
Simonsohn, 2011).
6 8 n summary, the demands for novelty and positive results create incentives for (a) generating new ideas BC
rather than pursuing additional evidence for or against ideas suggested previously; (b) reporting positive
results and ignoring negative results (Fanelli, 2012; Greenwald, 1975; Ioannidis & Trikalinos, 2007;
5
Rosenthal, 1979); and (c) pursuing design, reporting, and analysis strategies that increase the likelihood of
obtaining a positive result in order to achieve publishability (Fanelli, 2010a; Ioannidis, 2005; John et al.,
2012; Simmons et al., 2011; Wicherts, Bakker, & Molenaar, 2011; Wong, 1981; Young, Ioannidis, & Al-
Ubaydli, 2008).
6 9 Other contributions have detailed a variety of practices that can increase publishability but might BC
simultaneously decrease validity (Fanelli, 2010a; Giner-Sorolla, 2012; Greenwald, 1975; Ioannidis, 2005;
John et al., 2012; Kerr, 1998; Martinson, Anderson, & Devries, 2005; Rosenthal, 1979; Simmons et al.,
2011; Sovacool, 2008; Young et al., 2008).
6 10 The following are practices that are justifiable sometimes but can also increase the proportion of published PC
false results: (a) leveraging chance by running many low-powered studies, rather than a few high-powered
ones3 (Ioannidis, 2005); (b) uncritically dismissing “failed” studies as pilot tests or because of
methodological flaws but uncritically accepting “successful” studies as methodologically sound (Bastardi
et al., 2011; Lord, Ross, & Lepper, 1979); (c) selectively reporting studies with positive results and not
studies with negative results (Greenwald, 1975; John et al., 2012; Rosenthal, 1979) or selectively
reporting “clean” results (Begley & Ellis, 2012; Giner-Sorolla, 2012); (d) stopping data collection as soon
as a reliable effect is obtained (John et al., 2012; Simmons et al., 2011); (e) continuing data collection
until a reliable effect is obtained (John et al., 2012; Simmons et al., 2011); (f) including multiple
independent or dependent variables and reporting the subset that “worked” (Ioannidis, 2005; John et al.,
2012; Simmons et al., 2011); (g) maintaining flexibility in design and analytic models, including the
attempt of a variety of data exclusion or transformation methods, and reporting a subset (Gardner, Lidz, &
Hartwig, 2005; Ioannidis, 2005; Martinson et al., 2005; Simmons et al., 2011); (h) reporting a discovery
as if it had been the result of a confirmatory test (Bem, 2003; John et al., 2012; Kerr, 1998); and, (i) once a
reliable effect is obtained, not doing a direct replication (Collins, 1985; Schmidt, 2009; in an alternate
timeline, see Motyl & Nosek, 2012).
7 11 The impact of bias is exacerbated in an environment that puts a premium on output quantity: When BC
academic survival depends on how many papers one publishes, researchers are attracted to methods and
procedures that maximize the probability of publication (Bakker, van Dijk, & Wicherts, 2012; John,
Loewenstein, & Prelec, 2012; Neuroskeptic, 2012; Nosek, Spies, & Motyl, 2012).
7 12 In the face of human biases and the vested interest of the experimenter, such freedom of analysis provides BC
access to a Pandora’s box of tricks that can be used to achieve any desired result (e.g., John et al., 2012;
Simmons, Nelson, & Simonsohn, 2011; for what may happen to psychologists in the afterlife, see
Neuroskeptic, 2012).
7 13 The articles by Simmons et al. (2011) and John et al. (2012) suggest to us that considerable care needs to BC
6
be taken before researchers are allowed near their own data: They may well torture them until a
confession is obtained, even if the data are perfectly innocent.
8 14 Frank fabrication where all the data do not exist at all is probably uncommon, but other formes frustes of GC1
fraud may not be uncommon, and questionable research practices are probably very common (John,
Loewenstein, & Prelec, 2012).
9 15 A high reproducibility estimate might boost confidence in conventional research and peer-review BC
practices in the face of criticisms about inappropriate flexibility in design, analysis, and reporting that can
inflate the rate of false positives (Greenwald, 1975; John, Loewenstein, & Prelec, 2012; Simmons,
Nelson, & Simonsohn, 2011).
10 16 Recently, a growing number of methodological articles have painted a pessimistic picture of the state of GC2
the arts in behavioral science. From inappropriate significance testing (Bakker & Wicherts, 2011;
Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011) to questionable research practices (John,
Loewenstein, & Prelec, 2012; LeBel & Peters, 2011; Simmons, Nelson, & Simonsohn, 2011) and even
fraud, these articles have revived an interest in methodology.
10 17 Virtually all the critical arguments, and suggestions for improvement, that have been extracted from BC
recent articles on “voodoo correlations” (Fiedler, 2011; Vul, Harris, Winkielman, & Pasher, 2009),
inappropriate statistical tests (Nieuwenhuis, Forstmann, & Wagenmakers, 2011; Wagenmakers et al.,
2011), questionable research practices (John et al., 2012; Simmons et al., 2011), and replication (this
volume), are concerned with the problem of false positives or, in statistical jargon, α-error control.
10 18 Recently, several articles have gained publicity because they have linked the methodological issue of BC
replicability and quality of science to the serious ethical issue of data fabrication and fraud (John et al.,
2012). The critical assumption underlying this unfortunate linkage—which can cause great harm to the
image of behavioral sciences—is apparently that many false positives reflect researchers’ bad practices
and their deliberate strategies to deceive others and themselves.
11 19 Although studies with falsified data are rare, questionable research practices are all too common (John, GC1
Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011).
12 20 A more recent survey of psychologists found that over 90 % “admitted to having engaged in at least one SS
QRP” [5].
13 21 A survey suggests that psychologists openly admit multiple examinations of data before settling on the BC
most positive findings and to suppressing negative findings [66].
14 22 At the risk of creating a stereotype, I present here what might be a typical scenario. A team of excellent BC
scientists comes up with a smart new idea. They run a number of exploratory analyses based on one or a
few early, small samples and they publish a first polished, “clean” scientific report that highlights the best
7
results, typically without full documentation of the plethora of analyses that may have been performed.
Analytical choices have been employed selectively [2]
15 23 The past year has been a difficult one for the field of psychology, with several high-profile cases of fraud. GC1
It is appropriate to be outraged when falsehoods are presented as scientific evidence. On the other hand,
the large number of scientists who unintentionally introduce bias into their studies (John, Loewenstein, &
Prelec, 2012) probably causes more harm than the fraudsters.
16 24 Recent articles indicate that publication bias remains a problem in psychological journals (Fiedler, 2011; D
John, Loewenstein, & Prelec, 2012; Kerr, 1998; Simmons, Nelson, & Simonsohn, 2011; Strube, 2006;
Vul, Harris, Winkielman, & Pashler, 2009; Yarkoni, 2010).
16 25 Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping BC
dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of
obtaining a false-positive result. Moreover, a survey of researchers indicated that these practices are
common (John et al., 2012)
16 26 Everybody knows that researchers use a number of questionable research practices to increase their BC
chances of reporting significant results, and a high percentage of researchers admit to using these
practices, presumably because they do not consider them to be questionable (John et al., 2012)
16 27 There are positive signs in the literature on meta-analysis (Sutton & Higgins, 2008), the search for better BC
statistical methods (Wagenmakers, 2007), the call for more open access to data (Schooler, 2011), changes
in publication practices of journals (Dirnagl & Lauritzen, 2010), and increasing awareness of the damage
caused by questionable research practices (Francis, 2012a, 2012b; John et al., 2012; Kerr, 1998; Simmons
et al., 2011) to be hopeful that a paradigm shift may be underway
17 28 Moreover, the pressure to publish in peer-reviewed journals is so great that researchers may feel forced to BC
massage their results in such ways as to make them appear most significant and impactful, often at the
cost of the transparency with which they detail their hypotheses, methods, and analyses (John,
Loewenstijn & Prelec, 2012; Simons, Nelson & Simonsohn, 2011).
18 29 Fortunately, though, relatively few scientists appear to engage in outright fraud (John et al., 2012). O
19 30 This incertitude was quickly resolved by John et al. (2012). They surveyed over 2000 psychologists with SS
highly revealing results: Respondents affirmatively admitted to the practices of data peeking, data
monitoring, or conditional stopping in rates that varied between 20 and 60%
20 31 A recent survey of psychological researchers found that “selectively reporting studiesthat ‘work’” is the SS
“de facto scientific norm,” with prevalence esti-mates adjusted for response bias reaching 100% (John,
Loewenstein,& Prelec, 2012, p. 527).
21 32 Predictably, many researchers admit (anonymously) to selectively reporting experiments that produce SS
8
desirable outcomes (67% estimated prevalence), p value fishing (72% estimated prevalence), or failing to
report all dependent variables in an experiment (78% estimated prevalence) (John et al., 2012; and see
Neuroskeptic, 2012).
21 33 Yet when these students proceed to graduate level they encounter the unedifying reality that many BC
researchers – even unconsciously – will cherry pick analyses to reveal publishable effects or revise their
hypotheses to ‘predict’ unexpected findings (John et al., 2012).
22 34 Evidence suggests that some researchers formulate hypotheses after seeing the results of their studies. In a SS
recent survey of questionable research practices, about a third of the psychologists admitted that they had
reported unexpected findings as expected, although clinical psychologists had the lowest rates of the nine
subdisciplines included in the study (John, Loewenstein, & Prelec, 2012).
23 35 However, the mechanism’s effectiveness in actual application remains an open question, notwithstanding O
some initial encouraging results (Barrage and Lee 2010; John, Loewenstein, and Prelec 2012). The truth-
telling theorem requires assumptions that are not likely to be satisfied in any actual data set.
24 36 The use of questionable research practices—or researcher degrees of freedom—has also been highlighted BC
as contributing to false-positive results in psychology (John, Loewenstein, & Prelec, 2012; Simmons,
Nelson, & Simonsohn, 2011) and hence contributing to the nonreplicability of findings.
24 37 For measures, it is noteworthy that 15% (11 of 73) of design specification statements mentioned that BC
assessed measures went unreported because no statistically significant differences emerged on those
measures, which is clear evidence of researcher degrees of freedom (John et al., 2012; Simmons et al.,
2011).
24 38 With the additional disclosed information, readers of the article in question can more accurately interpret GC1
the reported results in light of the claim that many researchers have engaged in so-called questionable
research practices (John et al., 2012).
24 39 John et al. (2012) found that the majority of researchers—in a sample of over 2,000 psychologists— SS
admitted to not always reporting all of a study’s dependent measures or to deciding to collect more data
after looking to see whether the results were statistically significant.
25 40 John, Loewenstein, and Prelec (2012) asked 2,155 academic psychologists about the perceived prevalence SS
of 10 questionable research practices. About 40% of respondents admitted that they have occasionally
decided whether to exclude data after looking at the impact of doing so on the results. The authors argue
that this raw admission rates almost certainly underestimate the true prevalence. They offer an alternative
prevalence estimate that is derived from admission estimates, which indicates prevalence estimates that
are as high as 100% for 4 of the 10 research practices, including data exclusion after looking at the results.
25 41 Currently, researchers view selective reporting of studies that worked as a defensible practice (John et al., PC
9
2012).
26 42 However, so-called questionable research practices (e.g., omitting conditions or measures, concealing BC
negative results, sequential hypothesis testing) likely make considerable contributions to false positive
rates as well (John, Loewenstein, & Prelec, 2012).
26 43 Many factors contribute to elevated false positive rates, including small sample sizes (Button et al., 2013; D
Yarkoni, 2009), flexible analysis procedures (Carp, 2012a), and failures to publish negative results
(Ioannidis, 2005b; John et al., 2012). The risks posed by these practices have received increasing attention
in recent years. However, less has been said about perhaps the most basic requirement of reproducible
research: complete and clear description of experimental methods and results.
27 44 Reporting from a survey of 2155 psychologists Leslie John and colleagues (2012) similarly found that SS
although few of the respondents admitted to outright misconduct such as falsifying data (0.6%),
questionable practices such as “selectively reporting studies that ‘worked’” (45.8%) and “deciding
whether to exclude data after looking at the impact of doing so on the results” (38.2%), were prevalent
(John et al. 2012, p. 525).
27 45 Along the same lines Leslie John and colleagues (2012) argue that QRPs may lead to an arms race, where BC
it is necessary to venture deeper and deeper into the grey zone of QRP in order to survive as a
scientist:"QRPs are the steroids of scientific competition, artificially enhancing performance and
producing a kind of arms race in which researchers who strictly play by the rules are at a competitive
disadvantage. (John et al. 2012, p. 524)"
28 46 It is still possible that the statistically significant outcome represents a Type I error, even though the D
probability of a Type I error is typically set a comfortably low level (e.g., 5%). Further, depending on
individual research practices, the Type I error rate may be higher than the nominal alpha.
29 47 Another explanation for too much replication success is that authors used questionable research practices BC
or analysis methods (John et al., 2012, Simmons et al., 2011) in a way that increased the rejection rate of
their experiments
29 48 Unfortunately, some researchers seem to engage in sampling practices that lead to invalid hypothesis tests PC
(John et al., 2012).
29 49 They may have used inappropriate research practices (John et al., 2012, Simmons et al., 2011) that BC
inflated the rate of rejecting the null hypothesis.
29 50 However, it appears to be very common for researchers to engage in what are called “questionable GC1
research practices” (John et al., 2012) such as optional stopping, multiple testing, subject dropping, and
hypothesizing after the results are known (HARKing; (Kerr, 1998)).
30 51 Analytical flexibility is apparently also very common in the psychological sciences (Fanelli, 2010, John et GC1
10
al., 2012).
31 52 One recent study focused on the behavior of researchers with regard to questionable reporting practices. SS
John, Loewenstein, and Prelec (2012) anonymously surveyed more than 2,000 psychologists working at
research universities in the United States and found that 63% admitted to not reporting all dependent
measures that they assessed.
31 53 Although some outcome-reporting bias may be driven by space limitations in journals, both Chan and BC
Altman (2005) and John et al. (2012) both provide evidence that lack of statistical significance may be an
important reason why outcomes are omitted in published reports.
31 54 This result fits into an emerging research base raising concern about the degree of flexibility researchers BC
have in designing and analyzing studies, the lack of transparency of the reporting of many of these
choices, and the sometimes dramatic impact these choices can have on study results (e.g., Francis, in
press; Ioannidis, 2005; John et al., 2012; Simmons, Nelson, & Simonsohn, 2011).
32 55 Authors have an incentive not to comply with the ICMJE requirements. Prospective registration of trial GC1
protocol would obstruct significance questing practices such as: failing to report all of a study's dependent
measures, failing to report all of a study's conditions, and selectively reporting studies that “worked” (John
et al. 2012: 527). Significance questing is not only a problem for academic medicine, but a problem
plaguing any scientific discipline that prefers to publish statistically significant effects, including
psychology (Neuliep and Crandall 1990, 1993) and ecology (Palmer 2000; Jennions and Møller 2002a:
212). One meta-analysis suggests that 33.7% of scientific researchers admit to using at least one
questionable research practice (Fanelli 2009); a recent survey in psychology suggests that 94% have done
so (John et al. 2012: 527). These practices are now “the steroids of scientific competition, artificially
enhancing performance and producing a kind of arms race in which researchers who strictly play by the
rules are at a competitive disadvantage” (John et al. 2012: 524).
33 56 A number of reasons have been put forth for these concerns about replication, including conflicts of GC2
interest (Bakker & Wicherts, 2011; Ioannidis, 2011), misaligned incentives, questionable research
practices (John, Loewenstein, & Prelec, 2012) that result in what has been referred to as “p-hacking”
(Simmons et al., 2011), and ubiquitous low power (Button et al., 2013).
34 57 I can cite specific examples of the practices I criticize, but I cannot assess their frequency with statistical O
methods. John, Loewenstein, and Prelec (2012) offer an interesting estimation of the prevalence of
questionable research practices though.
35 58 John et al. (2012) find very high rates of questionable research practices among psychologists GC1
36 59 This emphasis on critical p values could, in turn, encourage problematic research practices, where BC
researchers engage a number of “researcher degrees of freedom” to achieve significant results (John,
11
Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011).
36 60 It has been speculated that this increased pressure to publish has impinged upon the integrity and D
objectivity of academic research (Fanelli, 2010; John et al., 2012; Song, Eastwood, Gilbody, Duley, &
Sutton, 2000)
36 61 All of these practices can be used to manipulate p values and potentially drive them towards significance BC
(John et al., 2012).
36 62 The ease with which data can be analysed may facilitate questionable research Table 2. practices (John et BC
al., 2012)
37 63 An anonymous survey of 2,000 psychologists estimated that the prevalence of data falsification was 9%, SS
although only 1.7% of respondents actually admitted having falsified data [12].
38 64 Other methodological problems, including expectancy effects and scientific misconduct, have been noted D
for decades or even centuries. However, scholars only recently started to assess systematically their
prevalence across fields and countries, study their causes, and openly discuss general solutions (e.g., 10–
13).
39 65 Selective reporting is typically regarded as a questionable research practice [13] GC2
40 66 Traditional studies investigating attitudes and behaviors via face-to-face interview may be biased by social O
desirability bias [12].
41 67 In a poll of more than 2000 psychologists, prevalences of ‘Deciding whether to collect more data after SS
looking to see whether the results were significant’ and ‘Stopping data collection earlier than planned
because one found the result that one had been looking for’ were subjectively estimated at 61% and 39%,
respectively (John, Loewenstein, & Prelec, 2012).
Table S2.2: Citing contexts: 2020
Pub Cit Citation Contexts Claim
ID ID type
381 1 When average estimates of others’ use are much higher than average self-report of the practice, it suggests O
that the practice is particularly socially undesirable and that self-report measures may underestimate
prevalence [17].
384 2 On the other hand, socially desirable responding might also affect QRP-reporting [27]. O
386 3 Yet there are additional forces and practices that can increase the rates of false positives. For example, GC1
there is a growing body of meta-scientific research showing the effects of excessive researcher degrees of
freedom (John et al. 2012; Simmons et al. 2011) or latitude in the way research is conducted, analyzed,
and reported.
12
386 4 The research practices that allow for this flexibility vary in terms of their severity and in the amount of O
consensus that exists on their permissibility (John et al. 2012). For example, researchers have sometimes
omitted failed experiments that do not support the focal hypothesis, and there are disagreements about the
severity and acceptability of this practice.
386 5 Indeed, a broader point is that there is debate over the extent of the problems that face psychology or other GC1
fields that have struggled with concerns about replicability such as the impact of publication bias (and
what to do about it; e.g., Cook et al. 1993; Ferguson & Brannick 2012; Ferguson & Heene 2012; Franco
et al. 2014; Kühberger et al. 2014; Rothstein & Bushman 2012) or the prevalence and severity of
questionable research practices (Fiedler & Schwarz 2015; John et al. 2012, Simmons et al. 2011).
390 6 Independently and nearly simultaneously, John et al. (2012) documented that a large fraction of GC1
psychological researchers admitted engaging in precisely the forms of p-hacking that we had considered;
for example, about 65% of respondents indicated that they had dropped a dependent variable when
reporting a study.
390 7 Along the same lines, Fiedler & Schwarz (2016) criticized John et al.’s (2012) survey assessing the CRIT
prevalence of questionable research practices on the grounds that the survey did not sufficiently
distinguish between selective reporting that was well intentioned and selective reporting that was ill
intentioned.
393 8 Some of the problem is due to researchers’ questionable research practices (Bakker, Van Dijk, & GC2
Wicherts, 2012; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011;
Woodside, 2016),
394 9 Fourth, researchers engage in questionable practices that degrade research quality. These can overlap with SS
researcher degrees of freedom and with limitations in data and method, but here our emphasis is on
9purposeful acts. John et al. (2012) surveyed more than 2,000 research psychologists, asking if they had
ever engaged in what the authors deemed QRPs like falsifying data or results, selectively reporting only
significant results, claiming to have predicted unexpected findings, arbitrarily excluding outliers, failing
to report all dependent variables, and deciding if more data needed to be collected after looking at initial
results. Although less than 1% of respondents admitted falsifying data or results, nearly half admitted
selectively reporting only studies that worked. More than half admitted not reporting all dependent
variables and peeking at results to determine whether to collect more data
395 10 Our knowledge about scientific misconduct is increasing. We know that there are non-negligible rates of GC1
serious misconduct, e.g. fabrication, falsification and plagiarism in many areas of academic research; and
that other types of misconduct, e.g. authorship misconduct and problematic data manipulation are even
more prevalent (Ana et al. 2013; Anderson, Martinson, and de Vries 2007; Bakker and Wicherts 2011;
13
Bozeman and Youtie 2016; Davis, Riske-Morris, and Diaz 2007; de Vries, Anderson, and
Martinson 2006; Fanelli 2009; Fang, Bennett, and Casadevall 2013; George 2016; Hofmann et al. 2015;
John, Loewenstein, and Prelec 2012; Komic, Marusic, and Marusic 2015; Lafollette 2000; Martinson et
al. 2006; Marusic, Bosnjak, and Jeroncic 2011; Okonta and Rossouw 2014; Pryor, Habermann, and
Broome 2007; Pupovac and Fanelli 2015; Ranstam et al. 2000; Redman, Yarandi, and Merz 2008; Sarwar
and Nicolaou 2012; Saurin 2016; Stern et al. 2014; Tijdink et al. 2016; Tijdink, Verbeke, and
Smulders 2014)
396 11 Concerns about false discoveries due to p-hacking, or data snooping, are not limited to finance, but BC
arguably affect all observational or experimental studies (Ioannidis (2005), John et al. (2012), Simonsohn
et al. 2014).
397 12 Possible reasons are that: (1) the variable was outside the scope of the specific study, (2) at that time, a PC
variable was not operationalized yet (for example, “emotional intelligence” or “cyberbullying” are fairly
new concepts), (3) the variable did not show interesting effects, and therefore the authors decided not to
report anything on this variable (selective reporting; John, Loewenstein, & Prelec, 2012).
398 13 The awareness that many analyses are underpowered (Rossi, 1990; Sedlmeier & Gigerenzer, 1989; Szucs GC1
& Ioannidis, 2017), based on questionable statistical practices (John, Loewenstein, & Prelec, 2012;
Simmons, Nelson, & Simonsohn, 2011), seem implausible in aggregate (Francis, 2012; Ioannidis, 2005)
or do not replicate (Open Science Collaboration, 2015) has generated a host of beneficial
recommendations.
400 14 The growing number of reports of research misconduct (Fanelli, 2009) and questionable research GC1
practices (L. K. John, Loewenstein, & Prelec, 2012) suggest that graduate students may be exposed to
both ethical and unethical choices and decision making by their mentors.
402 15 John et al. (2012) assume the vast majority of academics are sincerely motivated to conduct sound GC1
research. However, is their study of over 2,000 psychologists, they found a large grey exists regarding
acceptable versus QRPs. For example, they noted that falsifying data is never justified but perhaps not
failing to report all of a study’s dependent measures.
402 16 Fiedler and Schwarz (2016) noted that John et al.’s study received much media attention but considered CRIT
their study overestimated QRP prevalence. They decomposed QRP prevalence into its two related
components, proportion of scientists who ever committed such behaviour and if so, how frequently they
repeated this behaviour across all their research. Their resulting prevalence estimates were lower by order
of magnitudes and also focussed on the quality of survey instruments to collect data which might
influence analysis and results.
405 17 A survey of over 2000 psychology researchers indicates that HARKing is disturbingly prevalent [22]. PC
14
405 18 The shape and thrust of entire disciplines can be influenced by the undesirable implications of publication D
bias. These issues are not new. They have been comprehensively reported across many disciplines,
including medicine [21], psychology [22], political science [20], biology [13], and general science [14].
405 19 While many scientists might agree that other scientists are susceptible to inappropriate experimental SS
behaviour, evidence suggests that it is troublingly widespread. In a survey of over 2000 psychology
researchers, John et al. [22] examined the prevalence of questionable experimental practices. Their survey
questions serve as a useful classification of ten different forms of HARKing, as follows.
405 20 The percentage values show the respondent’s self-admission rates for the following questionable practices SS
[22, Table 1]: 1. failing to report all dependent measures, which opens the door for selective reporting of
favourable findings – 63.4%; 2. deciding to collect additional data after checking if the effects were
significant – 55.9%; 3. failing to report all of the study’s conditions – 27.7%; 4. stopping data collection
early once the significant effect is found – 15.6%; 5. rounding off a p value (e.g., reporting p = :05 when
the actual value is p = :054) – 22.0%; 6. selectively reporting studies that worked – 45.8%; 7. excluding
data after looking at the impact of doing so – 38.2%; 8. reporting an unexpected finding as having been
predicted – 27.0%; 9. reporting a lack of effect of demographic variables (e.g., gender) without checking
– 3.0%; 10. falsifying data – 0.6%.
406 21 With regard to other questionable research practices (QRPs) such as “p hacking,” although one study CRIT
showed that they were rampant in psychology (John, Loewenstein, & Prelec, 2012), another found that
those results were probably inflated by the way the questions were phrased (Fiedler & Schwarz, 2016).
407 22 QRPs comprise practices that unambiguously qualify as scientific misconduct (e.g., falsifying data) and GC2
others that are less clear (e.g., failing to report all of a study’s dependent measures; John, Loewenstein, &
Prelec, 2012; Motyl et al., 2017; Stürmer, Oeberst, Trötschel, & Decker, 2017).
407 23 There was criticism regarding the prevalence definition applied in some of the survey studies (e.g., John CRIT
et al., 2012) because the percentage of researchers who admitted to have engaged in a QRP at least once
was equated with the prevalence of the respective QRP (Fiedler & Schwarz, 2016).
407 24 This reflects that scientific misconduct is not always detected. Yet, scientific misconduct was prevalent GC1
across a variety of geographic regions (Agnoli et al., 2017; Braun & Roussos, 2012; John et al., 2012) and
in almost all psychological subfields.
409 25 The issue with dark numbers in estimating misconduct rates have lead scientists, in analogy with O
criminologists (Van Buggenhout and Christiaens 2016), to adopt various other ways of collecting data on
misconduct and errors in research, including (self-reported) misconduct surveys (Martinson et al. 2005),
sometimes using incentives for truth-telling (John et al. 2012); or digital tools for detecting problematic
research (Horbach and Halffman 2017a), in addition to retraction rates.
15
410 26 Whereas behaviors like data fabrication are clearly unethical, QRPs exploit the ethical shades of gray that GC1
color acceptable research practice and “offer considerable latitude for rationalization and self-
deception.”4 Consequently, QRPs are more prevalent and, many have argued, more damaging to science
and its public reputation than obvious fraud.4–8
412 27 The evidence suggesting engagement in questionable research practices has been found in social and GC1
natural sciences (Fanelli, 2009, 2010, 2011) including biomedical sciences (Ioannidis, 2005),
neuroscience (Vul, Harris, Winkielman, & Pashler, 2009), economics (Brodeur, Lé, Sangnier, &
Zylberberg, 2016), and psychology (John, Loewenstein, & Prelec, 2012).
416 28 There has been growing recognition of the BTS as a potential incentive mechanism for accurate reporting O
across a range of survey types (e.g. John et al., 2012; Weaver and Prelec, 2013).
417 29 This degree of transparency is geared explicitly toward reducing Questionable Research Practices (QRPs; GC2
Fiedler & Schwarz, 2016; John, Loewenstein, & Prelec, 2012), such as selective reporting of measures,
conditions, and/or data, coupled with haphazard sample size determination processes
424 30 Such blatant misconduct can be addressed, for example, through legal mechanisms, whereas the “less GC2
flagrant, more subtle cases of potential misconduct”, or what Fanelli (2009) and John et al. (2012) call
‘questionable research practices’, remain poorly understood.
16