The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. This increase in available data has great promise for pushing scientific discovery but also introduces new challenges that scientific communities need to address. The National Institutes of Health’s (NIH) Sequence Read Archive (SRA), which is maintained by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), is a rapidly growing public database that researchers use to improve scientific discovery across all domains of life. As part of the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, over 36 petabytes of “next generation” (raw and SRA-formatted) sequencing data is accessible to anybody via two cloud service providers.
To help address the challenges of conducting large-scale analysis of -omic data in the SRA and similar databases, the Department of Energy (DOE) Office of Biological and Environmental Research (BER), the NIH Office of Data Science Strategy (ODSS), and NCBI, held a virtual workshop on June 8, 2021, on Emerging Solutions in Petabyte Scale Sequence Search. The workshop brought together experts from DOE national labs, research institutions, and universities across the world.
SRA data growth over time. Databases like the NIH Sequence Read Archive are growing rapidly and are used extensively by scientific communities. As these databases grow, so do their potential scientific value, but work must be done to ensure ease of access.
This interagency workshop began with leaders from the NIH and DOE framing the impact of large-scale analysis on fundamental research and human health. Explaining the impetus for the workshop, Dr. Susan Gregurick, NIH Associate Director for Data Science and ODSS Director, said: “We all share a common problem and a need to develop, enhance, and implement methods that streamline data access, search or findability, and ultimately data reuse.” Dr. Todd Anderson, the Biological Systems Science Division Director from the DOE, added that “there is much to be gained from employing big data technology to assist with experimentation in biological sciences,” while welcoming attendees and their ideas on large-scale analysis of -omic data.
The workshop continued with presentations that described the current state of art in metagenomic sequence search including challenges in searching petabyte scale sequence data, artificial intelligence and supercomputing for data analytics, transforming data to wisdom, and using petabyte-scale search to map microbes.
Attendees also exchanged expertise and opinions in breakout rooms dedicated to the topics of challenges in scalable computing, analyzing large metagenomic datasets, and bottlenecks to current approaches in sequence search. One attendee explained that “open discourse on genomic data storage and computing was very productive,” and continued, “I hope our time together (at the workshop) will help improve accessibility and findability of data,” which is the long-term goal of this endeavor. The workshop culminated in a manuscript outline to report on the challenges and approaches that were discussed and prioritized.
To continue these efforts and begin tackling challenges in searching petabyte-scale sequence data, we are teaming up with the ODSS and the DOE BER to host the Petabyte Scale Sequence Search: Metagenomics Benchmarking Codeathon on Sept. 27-Oct. 1, from 1-5 p.m. EDT each day. Apply by August 20. If you have any questions about this codeathon, please reach out to the NCBI codeathon team at [email protected].
For any queries related to the petabyte scale sequence search initiative, please contact [email protected].