Scientists can’t even prove their data are real

I gave a workshop on preregistration to honours students last Friday and mentioned that preregistration provides evidence that you created your hypothesis in advance of seeing the data. The students naturally asked how to prove that the data were collected after the preregistration. I pointed out that one can’t even prove that the data are real. At least, nobody has put together an accepted method for this.

I have advocated for research funders and universities to create infrastructure for data chain-of-custody records to solve this, much as major camera companies are creating trusted timestamping etc. as part of their camera hardware, to certify that they aren’t deepfakes.

I haven’t heard of any movement on this in the case of my field of psychology, where human data is collected. Some biological Internet-connected instrument manufacturer have their machines automatically create records and send them to a manufacturer-managed database, I believe.

The lack of action on this is galling, even by those with plenty of resources. Harvard, hello – don’t you want to protect your overall reputation from the hit by the several fraudsters that more and more of us know that you hired and promoted?

So now the flood has begun: AI-written, AI-fabricated-data papers contaminating science, making fraud much easier and so surely greatly increasing its rate. At least the journal publishers have a financial interest in doing something about this, but I don’t see them getting their act together to create real infrastructure for this.

So I predict that in the short- to medium-term, it will be a case of rich-get-richer, where major Western universities that people/publishers trust more will get a pass, while those who (sometimes justifiably) are seen as more questionable (e.g., researchers at medical schools in China) due to high apparent past rates of fraud will be increasingly discriminated against. Merton’s norm of universalism (science is judged by the work itself, not by the reputation of the scientist) will rapidly be undermined.

Scientists! What are you supporting?

Much has been said about how expensive academic journals are. Large companies like Elsevier, Sage, Springer Nature, Taylor & Francis, and Wiley publish most of the major journals, and their shareholders pocket much of the “rent” they receive thanks to academics’ labor.


CC BY-SA 4.0 Fluffybuns.

There are alternatives. One of them is based on Wikipedia, whose process for vetting information is more transparent than that of most journals. The back-and-forth between authors and other Wikipedia volunteers that result in changes to Wikipedia is right there in the talk pages, available as it is happening, and anyone can chime in. Contrast this with academic journals, which are largely a closed shop.

To be fair, while the “shops” may be closed, they do have more windows than they used to. Many journals have come out from behind the paywalls, and now practice more accountability, such as by indicating which editor handled each article, and by having a policy on editors publishing in their own journal. To their credit, the Association for Psychological Science journals, for example, have long had a policy that when an editor or associate editor submits to their own journal, the review process for their article is managed by an external guest editor to avoid conflicts of interest. When I was an associate editor several years ago at one APS journal (AMPPS), this is what we did. I recently realized that not all APS editors are aware of their own policy, however, and that sort of forgetting is another example of why keeping the windows open, so that we can see what is happening inside, is important.

Photo: Public domain.

As part of the open windows principle, we should also expect journals to produce evidence that they effectively evaluate submissions for whether they are scientifically sound. Now, if asked how they we can be confident that they are publishing quality scholarship, most journal editors would point to peer review. When asked to produce examples, however, they’d have to say something like “Can’t do that! Peer review reports are confidential.”

This “you’ll just have to trust us” type of situation is ironic for a class of people who long have held skepticism to be critical to what they do. And for me as an acculturated academic, I confess it almost feels like a betrayal to state this as plainly as I have. I imagine colleagues trying to push aside the point, with responses like “Alex, you know we try hard to get good peer reviewers, besides, in the end, science yields things that work, so your point is misleading.”

Photo: Public domain

I actually agree that science works on average, but often readers need to know whether there is much reason to have confidence in particular papers. Fortunately not all editors are so defensive that they cannot acknowledge this point. It took time, but by about a decade or so ago, a bunch of journal editors had freed the peer review reports from the confines of their password-protected journal management systems, allowing anyone to read them. Finally, readers had direct evidence of how well a journal is actually vetting its articles. Just as importantly, readers no longer had to rely on the overall journal reputation to make a guess about the process undergone by an individual article – they could actually see the peer review reports for an article they were interested in.

While the processes happening inside journals had to be dragged into the open, Wikipedia and its associated projects have always had openness baked in.

One project associated with Wikipedia is the WikiJournal of Science. This is a proper scholarly journal, one indexed by mainstream publication databases such as the Scopus database maintained by Elsevier. But unlike a conventional journal, most of the peer review process at the WikiJournal of Science happens in the open from the beginning. It’s all in the “Discuss” page that sits alongside each article.

In another convergence with mainstream journals, four years ago the prestigious eLife journal announced that they would only review manuscripts that had already been published elsewhere as a preprint, as part of their “long-term plan to create a system of curation around preprints that replaces journal titles as the primary indicator of a paper’s perceived quality and impact”. This has always been the preferred route for the WikiJournal of Science – manuscripts ideally are submitted by linking to a publicly-available preprint.

I’ve been an associate editor for the WikiJournal of Science for a year or so. One manuscripts I handled reported a study suggesting that geckos spontaneously “play” by running in running wheels. As the editor, I was pleased to have the opportunity to usher in new knowledge about these gravity-defying reptiles.

The Australian house gecko. CC-BY me.

One of my first jobs was to email several experts on animal play to ask them to review the manuscript, which the author had posted as a preprint on WikiJournal Preprints. Two agreed, and after receiving the peer reviews, I posted them on the preprint’s Discuss page where, if anyone else were moved to do so, they could also comment. The author responded to the reviewers’ comments, and those responses also can be seen on the Discuss page. Much of the reviewing process, then, works like a conventional journal, just more transparently and able to appear in real time.

When I edited that manuscript, I had no scientific knowledge of animal play (moreover, I had consistently resisted our dog’s offers to give me real-world experience).

Hugo. More attractive dogs abound, but yeah, you can use this photo if you want.

It would have been nice if we had had a more knowledgeable editor for the gecko manuscript, but we’re currently spread pretty thin in the editorial department. That’s one reason for this post (apply to be an editor! You don’t need to know anything about geckos!).

As the 🦎 example illustrates, like a conventional academic journal, we publish original research at the WikiJournal of Science. But the most common use of the journal is for academic peer review of articles that are intended for Wikipedia itself, and these typically don’t include original research. Before I joined the journal as an editor, for example, I saw that the Wikipedia article for “multiple object tracking” was a bit spotty in its coverage. Unsurprising, of course, as it’s quite an obscure topic. But because I had just written a short book on object tracking, I considered myself well-placed to write a more comprehensive Wikipedia entry. The eventual article I wrote was based on my book, together with others’ publications, so it didn’t count as the type of original research that is prohibited by Wikipedia.

I submitted my draft Wikipedia article to the WikiJournal, and it eventually passed peer review. As a result, the editor replaced the existing Wikipedia entry with my article. This was quite satisfying – given how widely Wikipedia is used, my contributions to this obscure topic are probably now much more influential then if they had remained confined to academic journals and my book.

A nice aspect of the WikiJournal of Science is that part of the revision process occurs almost instantaneously, thanks to its wiki infrastructure. As I read through a submission, I typically make small edits on the preprint itself to improve the language, just as many Wikipedians do when they come across a Wikipedia entry they are interested in. The reviewers of the manuscript are able to do the same thing. The author is not obliged to keep these edits, of course; they can revert them and explain why in their response letter.

This really should be seen as basic functionality, as it is similar to the nearly universally-used Track Changes in Word or Google Docs. But despite most of us collaborating on documents in that way for decades now, most academic journals still don’t have this functionality.

Reviewers and editors at traditional journals typically aren’t able to enter the journal’s system and directly make suggestions on the manuscript. Instead, they write their comments in a separate, standalone document or form. This lack of functionality for scholarly communication is one illustration of how little the scientific community has gotten for the billions of dollars that they have been paying to publishers each year (the previous link is for APCs alone; it doesn’t even include the subscriptions payments, and the free peer reviewing that academics do).

The unwieldiness of journals’ systems is not because corporations generally don’t deliver good products or continually improve their service; many do. But academic journals are not part of a functioning economic market. In the dreamworld of a functioning market for scholarly communication, the journal that provides the best service and features would win the most market share. In the world we actually live in, the owners of the journals (who are sometimes the publisher, and sometimes a scientific society) simply wait for submissions from the researchers. They know that researchers will stick with the journals that have the highest impact factors in their field, which then results in those journals maintaining a high impact factor, with little effect of the fees charged or the quality of the services provided.

I think that all of this means that you should support diamond open access journals in general, not just the WikiJournal of Science.

Rob Lavinsky, iRocks.com. CC-BY-SA-3.0.

Diamond open access journals are those that are free to read and to publish in. They typically use open source software (the wiki infrastructure in the case of the WikiJournal of Science, and Open Journal Systems for thousands of other diamond OA journals) hosted by a nonprofit institution, such as a university. The open source software does tend to be more klunky than the big publishers’ systems, which does mean it’s more annoying for the academic editors involved. But the alternative, the tradeoff of letting corporate publishers handle things in return for billions of dollars and a corruption of academic values, is an even worse deal.

But why should you, an individual scholar, have to do something about this? The primary way that scientists in a field come together to get things done (aside from doing science, reviewing, and editing itself) is through scholarly societies. Scientific societies were designed to serve scientists’ interests. They should be leading the way to reducing dependence on corporate publishers and creating diamond OA journals.

But many scientific societies have been captured by their publishers. Here’s how it happened. As part of a contract giving a publisher the right to publish the society’s journal, the publisher provides the society with a payment. Over the years, this payment rose, reflecting the steady increase in subscription and/or APC fees. While the payment is only a small fraction of how much the journal makes (otherwise the publisher wouldn’t have the high profits that they do), it’s a substantial amount of money for a scientific society, and quite a high percentage in the current era of declining in-person conference fees. Societies pay much of their staff salaries off of this, and many hired more staff with this money. For many societies, these staff end up making most of the society’s decisions, or advising the academics who ostensibly make the decisions but offer little resistance. As the staff’s jobs depend on maintaining the society’s revenue, giving up the publishing income is a non-starter. This dynamic has played out even at some of the most respected and active scientific societies, as we recently learned in the case of the American Association for the Advancement of Science.

Within psychology, the Association for Psychological Science (APS) is another example. Six months ago, APS suddenly announced they were starting a new journal, with no evidence of consultation of academics. Indeed, the announcement was strangely light on details of why they were starting a journal and what the vision for it was. So I wasn’t the only one who suspected this was concocted simply to create a new revenue stream.

Yesterday, I did some digging. The publisher used by APS, Sage, maintains a spreadsheet with their list of publication fees (APCs) for the open access journals they publish. Advances in Psychological Science Open is now in that list, just below Advances in Methods and Practices in Psychological Science, formerly APS’ only fully open access journal. The price to publish in the new journal? Two Benjamins and five bills!

That APC (Article Processing Charge) of $2500 is $1500 more than that for APS’ more well-established journal (AMPPS).

In short, APS is starting an expensive journal that has little to no buy-in from the community (judging from social media) and hoping that demand for the prestige of the APS brand, combined with the reject-and-refer system developed by PLOS, and perfected by Nature Publishing, will bring the money rolling in.

I’m here all week, folks, re-inventing old jokes.

If you’re a tenured academic, you shouldn’t be editing for journals like that!

I better re-phrase that. Because admittedly, I myself took an editorial stipend from APS, first at Perspectives on Psychological Science over ten years ago when some of us started the Registered Replication Report format there, and subsequently when we co-founded the journal Advances in Methods and Practices in Psychological Science.

Here’s my rewrite: if you are a tenured academic, you should be devoting a bunch of your time to cultivating alternatives to the usual money-sucking journal racket.

Over at freejournals.org, we highlight quality diamond OA journals and we diamond OA editors try to support each other. So here I am, trying to promote this. While not many people read this blog, a lot of people are occasionally forced to read emails from me (simply because I am a more-or-less tenured academic). Therefore, I have changed my email signature I now advertise the diamond OA initiatives that I am most involved in.

My email signature, some of the time.

And now it is time for me to turn to other activities for avoiding the news.

Bridgeman Art Library. Supplied by The Public Catalogue Foundation.

Postscript. Perhaps the biggest challenge facing the WikiJournal of Science is our high liability insurance bill (for things like defamation suits); my colleagues have contacted dozens of insurers but none would give us a lower bill. And that was before Elon Musk started threatening Wikipedia! If you think you can help us, please get in touch.

Let your blog run free!

Don’t allow your writing to be tied to one platform – register your science-related blog with Rogue Scholar, the free blog indexing service helping bring science blogs into scholarly database infrastructure.

I checked with Martin Fenner, who created and runs Rogue Scholar, and he said it works fine with non-paywalled Substack blog posts, I think because the full text of free posts are provided by Substack in the RSS feed… I believe Rogue Scholar needs the full text partially to try to find some of the metadata needed to populate into scholarly databases.

I should say, however, that I found the registration page difficult to navigate and needed Martin’s help to register my blog. Turns out that this is because the site was recently re-worked and some parts still rely on the legacy codebase, causing some of the site’s internal links to be confusing. Growing pains are to be expected, however, especially for a free project, one that I believe is very much is worth your support!

Check out what Rogue Scholar has accomplished over the last year: https://blog.front-matter.io/posts/report-rogue-scholar-advisory-board-meeting-october-16-2024/

Committing research fraud is easy. Let’s make it harder.

originally published by The Chronicle of Higher Education as “How to Stop Academic Fraudsters” (I didn’t choose that title)

“Hi Alex, this is not credible.”

I’ll never forget that email. It was 2016, and I had been helping psychology researchers design studies that, I hoped, would replicate important and previously published findings. As part of a replication-study initiative that I and the other editors had set up at the journal Perspectives on Psychological Science, dozens of labs around the world would collect new data to provide a much larger dataset than that of the original studies.

With the replication crisis in full swing, we knew that data dredging and other inappropriate research practices meant that some of the original studies were unlikely to replicate. But we also thought our wide-scale replication effort would confirm some important findings. Upon receiving the “this is not credible” message, however, I began to be haunted by another possibility — that at least one of those landmark studies was a fraud.

The study in question was reminiscent of many published in high-impact journals in the mid-2010s. It indicated that people’s mood or behavior could be shifted a surprising amount by a subtle manipulation. The study had found that people became happier when they described a previous positive experience in a verb tense suggesting an ongoing experience — rather than one set firmly in the past. Unfortunately for psychology’s reputation, social-priming studies like that had been falling like a house of cards, and our replication failed, too. In response, the researchers behind the original study submitted a new experiment that appeared to shore up their original findings. With their commentary, the researchers provided the raw data for the new study, which was unusual at the time, but it was our policy to require it. This was critical to what happened next.

One scholar involved in the replication attempt had a close look at the Excel spreadsheet containing the new data. The spreadsheet had nearly 200 rows, one for each person who had supposedly participated in the experiment. But the responses of around 70 of them appeared to be exact duplicates of other people in the dataset. When the duplicates were removed, the main result was no longer statistically significant.

After thanking the scholar who had caught the problem, I pointed out the data duplication to the researchers behind the original study. They apologized for what they described as an innocent data-processing mistake. Then, rather conveniently, they discovered some additional data they said they had accidentally omitted. With that data added in, the result was statistically significant again. By this point, the scholar who had caught the duplication had had enough. The new data, and possibly the old, were no longer credible.

I conducted my own investigation of the Excel data. I confirmed the irregularities and found even more inconsistencies when I examined the raw data exactly as downloaded from the online service used to run the study. The other journal editors and I still didn’t believe that the reason for the irregularities was fraud — all along, the researchers behind the original study had seemed very nice and were very obliging about our data requests — but we decided that we shouldn’t publish the commentary that accompanied the questionable new data. We also reported them to their university’s research-integrity office. After an investigation, the university found that the data associated with the original study had been altered in strategic ways by a graduate student who had also produced the data for the new study. The case was closed, and the paper was retracted, but the cost had been substantial, involving thousands of hours of work by dozens of people involved in the replication, the university investigators, and at least one harried journal editor (me).

More recently, two high-profile psychology researchers, Francesca Gino of Harvard and Dan Ariely of Duke, faced questions about their published findings. The data in Excel files they have provided show patterns that seem unlikely to have occurred without inappropriate manipulation of the numbers. Indeed, one of Ariely’s Excel files contains signs of the sort of data duplication that occurred with the project I handled back in 2016.

Ariely and Gino both maintain that they never engaged in any research misconduct. They have suggested that unidentified others among their collaborators are at fault. Well, wouldn’t it be nice, for them and for all of us, if they could prove their innocence? For now, a cloud of suspicion hangs over both them and their co-authors. As the news has spread and the questions have remained unresolved, the cloud has grown to encompass other papers that Ariely and Gino were involved in, for which clear data records have not yet been produced. Perhaps as much to defend their own reputations as to clean up the scientific record, Gino’s collaborators have launched a project to forensically examine more than 100 of the papers that she has co-authored. This vast reallocation of academic expertise and university resources could, in a better system, be avoided.

How? Researchers need a record-keeping system that indicates who did what and when. I have been using Git to do this for more than a decade. The standard tool of professional software developers, Git allows me to manage my psychology-experiment code, analysis code, and data, and provides a complete digital paper trail. When I run an experiment, the data are recorded with information about the date, time, and host computer. The lines of code I write in R to do my analysis are also logged. An associated website, GitHub, stores all of those records and allows anyone to see them. If someone else in my lab contributes data or analysis, they and their contributions are also logged. Sometimes I even write up the resulting paper through this system, embedding analysis code within it, with every data point and statistic in the final manuscript traceable back to its origin.

My system is not 100 percent secure, but it does make research misconduct much more difficult. Deleting inconvenient data points would be detectable. Moreover, if trusted timestamping is used, the log of file changes is practically unimpeachable. Git is not easy to learn, but the basic concept of “version history” is today part of Microsoft Word, Google Docs, and other popular software and systems. Colleges and universities should ensure that whatever software their researchers use keep good records of what the researchers do with their files.

While enabling more recording of version history would be only a small step, it could go a long way. The Excel files that Gino and Ariely have provided have little to no embedded records indicating what changes were made and when. That’s not surprising — their Excel files were created years ago, before Excel could record a version history. Even today, however, with its default setting, Excel deletes from its record any changes older than 30 days. Higher-ed institutions should set their enterprise Excel installations to never delete their version histories. This should also be done for other software that researchers commonly use.

Forensic data sleuthing has found that a worrying number of papers published today contain major errors, if not outright fraud. When the anesthesiologist John Carlisle scrutinized work submitted to the journal he edited, Anaesthesia, he found that of 526 submitted trials, 73 (14 percent) had what seemed to be false data, and 43 (8 percent) were so flawed they would probably be retracted if their data flaws became public (he termed these “zombie” trials). Carlisle’s findings suggest that the literature in some fields is rapidly becoming littered with erroneous and even falsified results. Fortunately, the same record-keeping that allows one to conduct an audit in cases of fraud can also help colleges, universities, journals, and researchers prevent errors in the first place.

Errors will always occur, but they are less likely to cause long-lasting damage if someone can check for them, whether that’s a conscientious member of the research team, a reviewer, or another researcher interested in the published paper. To better check the chain of calculations associated with a scientific claim, more researchers should be writing their articles in a system that can embed code, so that the calculations behind each statistic and point on a plot can be checked. These are sometimes called “executable articles” because pressing a button executes code that can use the original data to regenerate the statistics and figures.

Scholars don’t need to develop such systems from scratch. A number of services have sprung up to help those of us who are not seasoned programmers. A cloud service called Code Ocean facilitates the creation of executable papers, preserving the software environment originally used so that the code still executes years later. Websites called Overleaf and Authorea help researchers create such documents collaboratively rather than leaving it all on one researcher’s computer. The biology journal eLife has used a technology called Stencila to permit researchers to write executable papers with live code, allowing a paper’s readers to adjust the parameters of an analysis or simulation and see how that changes its results.

Universities and colleges, in contrast, have generally done very little to address fraud and errors. When I was a Ph.D. student in psychology at Harvard, there were two professors on the faculty who were later accused of fraud. One of them owned up to the fraud and helped get her work retracted. The other, Marc Hauser, “lawyered up” and fought the accusations, but nevertheless he was found by Harvard to have committed scientific misconduct (the U.S. Office of Research Integrity also found him to have fabricated data).

As a result, Harvard had more than a decade after the findings of serious fraud by two of its faculty members to prepare for, and try to prevent, future misconduct. When news of the Gino scandal broke, I was shocked to learn how little Harvard seemed to have improved its policies. Indeed, Harvard scrambled to rewrite its misconduct policies in the wake of the new allegations, opening up the university to accusations of unfair process, and to Gino’s $25-million lawsuit.

The problems go well beyond Harvard or Duke or even the field of psychology. Not long after John Carlisle reported his alarming findings from clinical-trial datasets in anesthesiology, a longtime editor of the prestigious BMJ (formerly the British Medical Journalsuggested that it was time to assume health research is fraudulent until proven otherwise. Today, a number of signs suggest that the problems have only worsened.

Marc Tessier-Lavigne is a prominent neuroscientist and was, until recently, president of Stanford University. He had to resign after evidence emerged of “apparent manipulation of research data by others” in several papers that came from his lab — but not until after many months of dogged reporting by the Stanford student newspaper. Elsewhere in the Golden State, the University of Southern California is investigating the star neuroscientist Berislav Zlokovic over accusations of doctored data in dozens of papers, some of which led to drug trials in progress.

In biology labs like those of Tessier-Lavigne and Zlokovic, the data associated with a scientific paper often include not only numbers but also images from gel electrophoresis or microscopy. An end-to-end chain of certified data provenance there presents a greater challenge than in psychology, where everything involved in an experiment may be in the domain of software. To chronicle a study, laboratory machines and microscopes need to record data in a usable, timestamped format, and must be linked into an easy-to-follow laboratory notebook.

If we want science to be something that society can still trust, we must embrace good data management. The $25 million that Harvard could lose to Gino — while a mere drop in the operating budget — would go far if spent on developing good data-management systems and training researchers in their use. The reputational returns to Harvard, to its scholars, and to academic science in general would repay the investment many times over. It’s time to stop pretending academic fraud isn’t a problem, and to do something about it.

Gates Foundation, and me: Mandate preprints, support peer review services outside of big publishers

Today the Gates Foundation announced that they will “cease support for individual article publishing fees, known as APCs, and mandate the use of preprints while advocating for their review”. I am excited by this news because over the last couple decades, it’s been disheartening to see large funders continue to pour money down the throats of high-profit multinational publishers .

In their announcement, the Gates Foundation has recommendations for research funders that include the following:

Invest funding into models that benefit the whole ecosystem and not individual funded researchers.

They also state that funders, and researchers, should support innovative initiatives that facilitate peer review and curation separately from traditional publication.

Diamond OA journals, which are free to authors as well as readers, clearly fit the bill, as well as journal-independent review services such as Peer Community In, PreReview, and COAR-Notify. I’m an (unpaid) advisory board member of the Free Journal Network, which supports (and does some light vetting of) diamond OA journals. I’m also an associate editor at the free WikiJournal of Science, Meta-Psychology, and the coming Meta-ROR metascience peer review platform. All of these initiatives are oriented around providing free peer review of preprints.

Such initiatives have had trouble attracting funding, as have preprint servers, despite the enormous benefit preprint servers have provided of rapid dissemination of research; much faster than through journals.

Because of how agreements like Germany’s DEAL (and Australia’s planned deal) facilitate publisher lock-in, my favorite episode in the history of such negotiations is the extended periods when German and Californian universities did not have access to Elsevier publications, pushing them away from Elsevier rather than toward it. As Björn Brembs and I wrote in 2017, the best DEAL is no deal. When funders have an agreement with them, researchers are unfortunately pushed toward high-profit, progress-undermining publishers like Elsevier  as in that case publishing with Elsevier is free, while it may not be with more progressive and lower-cost publishers. And as an Australian colleague was quoted saying, the proposed agreement with Elsevier would “enshrine a national debt to wealthy international publishers, who were likely to tack on hefty increases once an agreement was reached.”

An executive summary of science’s replication crisis

To evaluate and build on previous findings, a researcher sometimes needs to know exactly what was done before.

Computational reproducibility is the ability to take the raw data from a study and re-analyze it to reproduce the final results, including the statistics.

Empirical reproducibility is demonstrated when, if the study is done again by another team, the critical results reported by the original are found again.

Poor computational reproducibility

Economics Reinhart and Rogoff, two respected Harvard economists, reported in a 2010 paper that growth slows when a country’s debt rises to more than 90% of GDP. Austerity backers in the UK and elsewhere invoked this many times. A postgrad failed to replicate the result, and Reinhart and Rogoff sent him their Excel file. They had unwittingly failed to select the entire list of countries as input to one of their formulas. Fixing this diminished the reported effect, and using a variant of the original method yielded the opposite result than that used to justify billions of dollars’ worth of national budget decisions.

A systematic study found that only about 55% of studies could be reproduced, and that’s only counting studies for which the raw data were available (Vilhuber, 2018).

Cancer biology The Reproducibility Project: Cancer Biology found that for 0% of 51 papers could a full replication protocol be designed with no input from the authors (Errington, 2019).

Not sharing data or analysis code is common. Ioannidis and colleagues (2009) could only reproduce about 2 out of 18 microarray-based gene-expression studies, mostly due to lack of complete data sharing.

Artificial intelligence (machine learning) A survey of reinforcement learning papers found only about 50% included code, and in a study of publications associated with neural net recommender systems, only 40% were found to be reproducible (Barber, 2019).

Poor empirical reproducibility

Wet-lab biology.  Amgen researchers were shocked when they were only able to replicate 11% of 53 landmark studies in oncology and hematology (Begley and Ellis, 2012).

“I explained that we re-did their experiment 50 times and never got their result. He said they’d done it six times and got this result once, but put it in the paper because it made the best story.” Begley

A Bayer team reported that ~25% of published preclinical studies could be validated to the point at which projects could continue (Prinz et al., 2011). Due to poor computational reproducibility and methods sharing, the most careful effort so far (Errington, 2013), of 50 high-impact cancer biology studies, decided only 18 could be fully attempted, and has finished only 14, of which 9 are partial or full successes.

From Maki Naro’s 2016 cartoon.

Social sciences

62% of 21 social science experiments published in Science and Nature between 2010 and 2015 replicated, using samples on average five times bigger than the original studies to increase statistical power (Camerer et al., 2018).

61% of 18 laboratory economics experiments successfully replicated (Camerer et al., 2016).

39% of 100 experimental and correlational psychology studies replicated (Nosek et al.,, 2015).

53% of 51 other psychology studies (Klein et al., 2018; Ebersole et al., 2016; Klein et al. 2014) and ~50% of 176 other psychology studies (Boyce et al., 2023)

Medicine

Trials: Data for >50% never made available, ~50% of outcomes not reported, author’s data lost at ~7%/year (Devito et al, 2020)

I list six of the causes of this sad state of affairs in another post.

References

Barber, G. (n.d.). Artificial Intelligence Confronts a “Reproducibility” Crisis. Wired. Retrieved January 23, 2020, from https://www.wired.com/story/artificial-intelligence-confronts-reproducibility-crisis/

Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483(7391), 531–533.

Boyce, V., Mathur, M., & Frank, M. C. (2023). Eleven years of student replication projects provide evidence on the correlates of replicability in psychology. PsyArXiv. https://doi.org/10.31234/osf.io/dpyn6

Bush, M., Holcombe, A. O., Wintle, B. C., Fidler, F., & Vazire, S. (2019). Real problem, wrong solution: Why the Nationals shouldn’t politicise the science replication crisis. The Conversation. http://theconversation.com/real-problem-wrong-solution-why-the-nationals-shouldnt-politicise-the-science-replication-crisis-124076

Camerer, C. F., et al.,  (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z

Camerer, C. F. et al. Evaluating replicability of laboratory experiments in economics. Science 351, 1433–1436 (2016). DOI: 10.1126/science.aaf0918

DeVito, N. J., Bacon, S., & Goldacre, B. (2020). Compliance with legal requirement to report clinical trial results on ClinicalTrials.gov: A cohort study. The Lancet, 0(0). https://doi.org/10.1016/S0140-6736(19)33220-9

Ferrari Dacrema, Maurizio; Cremonesi, Paolo; Jannach, Dietmar (2019). “Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches”. Proceedings of the 13th ACM Conference on Recommender Systems. ACM: 101–109. doi:10.1145/3298689.3347058. hdl:11311/1108996.

Ebersole, C. R. Et al. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. https://doi.org/10.1016/j.jesp.2015.10.012

Errington, T. (2019) https://twitter.com/fidlerfm/status/1169723956665806848

Errington, T. M., Iorns, E., Gunn, W., Tan, F. E., Lomax, J., & Nosek, B. A. (2014). An open investigation of the reproducibility of cancer biology research. ELife, 3, e04333. https://doi.org/10.7554/eLife.04333

Errington, T. (2013). https://osf.io/e81xl/wiki/home/

Glasziou, P., et al. (2014). Reducing waste from incomplete or unusable reports of biomedical research. The Lancet, 383(9913), 267–276. https://doi.org/10.1016/S0140-6736(13)62228-X

Ioannidis, J. P. A., Allison, D. B., et al. (2009). Repeatability of published microarray gene expression analyses. Nature Genetics, 41(2), 149–155. https://doi.org/10.1038/ng.295

Klein, R. A., et al. (2018). Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https://doi.org/10.1177/2515245918810225

Klein, R. A., et al. (2014). Investigating Variation in Replicability. Social Psychology, 45(3), 142–152. https://doi.org/10.1027/1864-9335/a000178

Nosek, B. A., Aarts, A. A., Anderson, C. J., Anderson, J. E., Kappes, H. B., & Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716.

Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).

Vilhuber, L. (2018). Reproducibility and Replicability in Economics https://www.nap.edu/resource/25303/Reproducibility%20in%20Economics.pdf

A legacy of skepticism and universalism

Many of the practices associated with modern science emerged in the early days of the Royal Society of London for Improving Natural Knowledge, which was founded in 1660. Today, it is usually referred to as simply “the Royal Society”.  When the Royal Society chose a coat of arms, they included the words Nullius in verba

Nullius in verba is usually taken to mean “Take nobody’s word for it”, which was a big departure from tradition. People previously had mostly been told to take certain things completely on faith, such as the proclamations of the clergy and even the writings of Aristotle. 

In the early 1600s, René Descartes had written a book urging people to be skeptical of what others claim, no matter who they are.

Rene Descartes. Image: public domain.

This caught on in France, even among the public — many people started referring to themselves as Cartesians. Meanwhile in Britain, the ideas of Francis Bacon were becoming influential. His skepticism was less radical than Descartes’, and included many practical suggestions for how knowledge could be advanced.

Francis Bacon in 1616. Image:public domain.

Bacon’s mix of skepticism with optimism about advancing knowledge using observation led, in 1660, to the founding in London of “a Colledge for the Promoting of Physico-Mathematicall Experimentall Learning”. This became the Royal Society.

The combination of skepticism and the opening up of knowledge advancement to contemporary people, not just traditional authorities, set the stage for the success of modern science. When multiple skeptical researchers take a close look at the evidence behind a new claim and are unable to find major problems with the evidence, everyone can then be more confident in the claim. As the historian David Wootton has put it, “What marks out modern science is not the conduct of experiments”, but rather “the formation of a critical community capable of assessing discoveries and replicating results.”

Taking the disregard of traditional authority further, in the 20th century the sociologist Robert Merton suggested that scientists value universalism. By universalism, Merton meant that in science, claims are evaluated without regard to the sort of person providing the evidence. Evidence is evaluated by scientists, Merton wrote, based on “pre-established impersonal criteria”. 

Universalism provides a vision of science that is egalitarian, and universalism is endorsed by large majorities of today’s scientists. However, those who endorse it don’t always follow it in practice. Scientific organizations such as the Royal Society can be elitist. For example, sometimes the scholarly journals that societies publish treat reports by famous researchers with greater deference than those by other researchers.

Placing some trust in authorities (such as famous researchers) is almost unavoidable, because in life we have to make decisions about what to do even when we can’t be completely certain of the facts. In such situations, it can be appropriate to “trust” authorities, believing their proclamations. We don’t have the resources to assess all kinds of scientific evidence ourselves, so we have to look to those who seem to have a track record of making well-justified claims in a particular area. But when it comes to the development of new, cutting-edge knowledge, science thrives on the skepticism that drives the behavior of some researchers.

Together, the values of communalism, skepticism, and a mixture of universalism and elitism shaped the growth of scientific institutions, including the main way in which researchers officially communicated their findings: through academic journals.

Introduction to reproducibility

A brief intro for research students.

Good science is cumulative

Scientists seek to create new knowledge, often by conducting an experiment or other research study.

But science is much more than doing studies and analyzing the data. Critical to the scientific enterprise is communication of what was done, and what was found.

GodfreyKneller-IsaacNewton-1689

Isaac Newton, who formulated the laws of motion and gravity, wrote that “If I have seen further it is by standing on the shoulders of giants.” Newton knew that science is cumulative – we build on the findings of previous researchers.

Robert Merton, a sociologist of science, described values or norms that are endorsed by many scientists. One of these that is critical to ensuring that science is cumulative is the norm of communalism. Robert_K_Merton Communalism refers to the notion that scientific methodologies and results are not the property of individuals, but rather should be shared with the world.

Sharing allows others to know the details of a previous study, which is important for:

  1. Understanding the study’s results

  2. Building on its methodology

  3. Confirming its findings

This last purpose is, arguably, essential. But across diverse sciences, ensuring that confirmation can be done, as well as actually doing it, has been neglected. This is the issue of reproduciblity.

Reproducibility

Another scientific norm important to achieving reproducibility was dubbed organized skepticism by Merton. The critical community provided by other researchers is thought to be key to the success of science. The Royal Society of London for Improving Natural Knowledge, more commonly known as simply the Royal Society, was founded in 1660 and established many of the practices we today associate with science worldwide. The Latin motto of the Royal Society, “Nullius in verba”, is often translated as “Take nobody’s word for it”. 

Anyone can make a mistake, and most or all of us have biases, so scientific claims should be verifiable. The historian of science David Wooton has written that “What marks out modern science is not the conduct of experiments”, but rather “the formation of a critical community capable of assessing discoveries and replicating results.”

Types of reproducibility

Assessing discoveries and replicating results can involve two distinct types of activities. One can include examining the records associated with a study to check for errors. The second involves attempting to re-do the study and see whether similar data results that support the original claim.

The first type of activity is often referred to today with the phrase computational reproducibility. The word “computational” refers to taking the raw observations or data recorded for a study and re-doing the analysis that most directly supports the claims made.

The second activity is often referred to as replication. If a study is redone by collecting new data, does this replication study yield similar results to the original study? If very different results are obtained, this may call into question the claim of the original study, or indicate that the original study is not one that can be easily built upon.

Sometimes the word empirical is put in front of reproducibility or replication to make it clear that new data are being collected, rather than referring to computational reproducibility.

The replication crisis

The importance of reproducibility, in principle, has been recognized as critical throughout the history of science. In practice, however, many sciences have failed to adequately incentivize replication. This is one reason (later we will describe others) for the replication crisis.

The replication crisis refers to the discovery, and subsequent reckoning with, the poor success rates of efforts to computationally reproduce or empirically replicate previous studies.

replicationCrisisSurveyResults

The credibility revolution

The credibility revolution refers to the efforts by individual researchers, societies, scientific journals, and research funders to improve reproducibility. This has led to

  1. New best practices for doing individual studies

  2. Changes in how researchers and their funding applications are evaluated

  3. Greater understanding of how to evaluate the credibility of the claims of individual studies

The word credibility refers to how believable a theory or claim is. This reflects both how plausible it is before one hears of any relevant evidence, plus the evidence for the theory. Thus if a claim is highly credible, the probability that it is true is high.  The phrase credibility revolution helps convey that reforms related to reproducibility have boosted the credibility of many scientific theories and claims.

Just a list of our VSS presentations for 2019

Topics this year: Visual letter processing; role of attention shifts versus buffering (mostly @cludowici, @bradpwyble); reproducibility (@sharoz); visual working memory (mostly @will_ngiam)

Symposium contribution, 12pm Friday 17 May:  Reading as a visual act: Recognition of visual letter symbols in the mind and brain

Implicit reading direction and limited-capacity letter identification

ebmocloH xelA, The University of Sydney
(abstract now has better wording)
I would like to congratulate you for reading this sentence. Somehow you dealt with a severe restriction on simultaneous identification of multiple objects – according to the influential “EZ reader” model of reading, humans can identify only one word at a time. Reading text apparently involves a highly stereotyped attentional routine with rapid identification of individual stimuli, or very small groups of stimuli, from left to right. My collaborators and I have found evidence that this reading routine is elicited when just two widely-spaced letters are briefly presented and observers are asked to identify both letters. A large left-side performance advantage manifests, one that is absent or reversed when the two letters are rotated to face to the left instead of to the right. Additional findings from RSVP (rapid serial visual presentation) lead us to suggest that both letters are attentional selected simultaneously, with the bottleneck at which one letter is prioritized sited at a late stage – likely at an identification or working memory consolidation process. Thus, a rather minimal cue of letter orientation elicits a strong reading direction-based prioritization routine. Our ongoing work seeks to exploit this to gain additional insights into the nature of the bottleneck in visual identification and how reading overcomes it.

 

Is there a reproducibility crisis around here? Maybe not, but we still need to change.

Alex O Holcombe1, Charles Ludowici1, Steve Haroz2

Poster 2:45pm Sat 18 May

1School of Psychology, The University of Sydney

2Inria, Saclay, France

Those of us who study large effects may believe ourselves to be unaffected by the reproducibility problems that plague other areas. However, we will argue that initiatives to address the reproducibility crisis, such as preregistration and data sharing, are worth adopting even under optimistic scenarios of high rates of replication success. We searched the text of articles published in the Journal of Vision from January through October of 2018 for URLs (our code is here: https://osf.io/cv6ed/) and examined them for raw data, experiment code, analysis code, and preregistrations. We also reviewed the articles’ supplemental material. Of the 165 articles, approximately 12% provide raw data, 4% provide experiment code, and 5% provide analysis code. Only one article contained a preregistration. When feasible, preregistration is important because p-values are not interpretable unless the number of comparisons performed is known, and selective reporting appears to be common across fields. In the absence of preregistration, then, and in the context of the low rates of successful replication found across multiple fields, many claims in vision science are shrouded by uncertain credence. Sharing de-identified data, experiment code, and data analysis code not only increases credibility and ameliorates the negative impact of errors, it also accelerates science. Open practices allow researchers to build on others’ work more quickly and with more confidence. Given our results and the broader context of concern by funders, evident in the recent NSF statement that “transparency is a necessary condition when designing scientifically valid research” and “pre-registration… can help ensure the integrity and transparency of the proposed research”, there is much to discuss.

 

Talk saturday 18 May 2.30pm

A delay in sampling information from temporally autocorrelated visual stimuli
Chloe Callahan-Flintoft1, Alex O Holcombe2, Brad Wyble1
1Pennsylvania State University
2University of Sydney
Understanding when the attentional system samples from continuously changing input is important for understanding how we build an internal representation of our surroundings. Previous work looking at the latency of information extraction has found conflicting results. In paradigms where features such as color change continuously and smoothly, the color selected in response to a cue can be as long as 400 ms after the cue (Sheth, Nijhawan, & Shimojo, 2000). Conversely, when discrete stimuli such as letters are presented sequentially at the same location, researchers find selection latencies under 25 ms (Goodbourn & Holcombe, 2015). The current work proposes an “attentional drag” theory to account for this discrepancy. This theory, which has been implemented as a computational model, proposes that when attention is deployed in response to a cue, smoothly changing features temporally extend attentional engagement at that location whereas a sudden change causes rapid disengagement. The prolonged duration of attentional engagement in the smooth condition yields longer latencies in selecting feature information.
In three experiments participants monitored two changing color disks (changing smoothly or pseudo-randomly). A cue (white circle) flashed around one of the disks. The disks continued to change color for another 800 ms. Participants reported the disk’s perceived color at the time of the cue using a continuous scale. Experiment 1 found that when the color changed smoothly there was a larger selection latency than when the disk’s color changed randomly (112 vs. 2 ms). Experiment 2 found this lag increased with an increase in smoothness (133 vs. 165 ms). Finally, Experiment 3 found that this later selection latency is seen when the color changes smoothly after the cue but not when the smoothness occurs only before the cue, which is consistent with our theory.

 

Poster 2pm 20 May

Examining the effects of memory compression with the contralateral delay activity
William X Ngiam1,2, Edward Awh2, Alex O Holcombe1
1School of Psychology, University of Sydney
2Department of Psychology, University of Chicago
While visual working memory (VWM) is limited in the amount of information that it can maintain, it has been found that observers can overcome the usual limit using associative learning. For example, Brady et al. (2009) found that observers showed improved recall of colors that were consistently paired together during the experiment. One interpretation of this finding is that statistical regularities enable subjects to store a larger number of individuated colors in VWM. Alternatively, it is also possible that performance in the VWM task was improved via the recruitment of LTM representations of well-learned color pairs. In the present work, we examine the impact of statistical regularities on contralateral delay activity (CDA) that past work has shown to index the number of individuated representations in VWM. Participants were given a bilateral color recall task with a set size of either two or four. Participants also completed blocks with a set size of four where they were informed that colors would be presented in pairs and shown which pairs would appear throughout, to encourage chunking of the pairs. We find this explicit encouragement of chunking improved memory recall but that the amplitude of the CDA was similar to the unpaired condition. Xie and Zhang (2017; 2018) previously found evidence that familiarity produces a faster rate of encoding as indexed by the CDA at an early time window, but no difference at a late time window. Using the same analyses on the present data, we instead find no differences in the early CDA, but differences in the late CDA. This result raises interesting questions about the interaction between the retrieval of LTM representations and what the CDA is indexing.

 

Poster Tues 21 May 245pm

Selection from concurrent RSVP streams: attention shift or buffer read-out?
Charles J H Ludowici, Alex O. Holcombe
School of Psychology, The University of Sydney, Australia
Selection from a stream of visual information can be elicited via the appearance of a cue. Cues are thought to trigger a time-consuming deployment of attention that results in selection for report of an object from the stream. However, recent work using rapid serial visual presentation (RSVP) of letters finds reports of letters just before the cue at a higher rate than is explainable by guessing. This suggests the presence of a brief memory store that persists rather than being overwritten by the next stimulus. Here, we report experiments investigating the use of this buffer and its capacity. We manipulated the number of RSVP streams from 2 to 18, cued one at a random time, and used model-based analyses to detect the presence of attention shifts or buffered responses. The rate of guessing does not seem to change with the number of streams. There are, however, changes in the timing of selection. With more streams, the stimuli reported are later and less variable in time, decreasing the proportion reported from before the cue. With two streams – the smallest number of streams tested – about a quarter of non-guess responses come from before the cue. This proportion drops to 5% in the 18 streams condition. We conclude that it is unlikely that participants are using the buffer when there are many streams, because of the low proportion of non-guesses from before the cue. Instead, participants must rely on attention shifts.