Tag: Sequence Read Archive (SRA)

Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA

Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA

As previously announced, in collaboration with our partners at the International Nucleotide Sequence Database Collaboration (INSDC), we will begin to systematically gather ‘location of collection’ and ‘date and time of collection’ for sequence data submitted to GenBank and the Sequence Read Archive (SRA). Gathering information about where and when a biological sample was collected aligns with other global sequence submission standardization efforts and will increase the utility of data made available through GenBank and SRA. These changes will be implemented in a phased approach through December 2024.

What’s new?

Sequence data submitted to GenBank and the SRA will need to include information about location and date and time of sample collection. These metadata will be entered using the pre-existing fields ‘country’ and ‘collection_date.’ Minimum information for these fields is described below. We encourage submitters to provide additional details when available: Continue reading “Coming Soon! Including Sample Location and Collection Date and Time for Sequences Submitted to GenBank and SRA”

Streamlining Access to SRA COVID-19 Datasets on the Cloud

Streamlining Access to SRA COVID-19 Datasets on the Cloud

To make it easier for you to find and access Sequence Read Archive (SRA) data, we are re-organizing and improving our cloud storage systems.  

Beginning April 2023, we will move the SARS-CoV-2 normalized data and source files from the COVID-19 data buckets on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to the NIH NCBI SRA on AWS registry. We will also remove the SARS-CoV-2 original format data from AWS and GCP COVID-19 buckets and make them available in AWS cold storage. If you need these data, you can request them using the Cloud Data Delivery Service (CDDS). 

Where and how will I be able to access SARS-CoV-2 normalized data after this change occurs?

To ensure a smooth transition, we want you to have enough time to adjust your scripts and pipelines to minimize disruption to your analyses.   Continue reading “Streamlining Access to SRA COVID-19 Datasets on the Cloud”

3+ Ways NCBI is Enhancing the SRA Database

3+ Ways NCBI is Enhancing the SRA Database

Do you submit or access Sequence Read Archive (SRA) data? In an ongoing effort to enhance your experience, NCBI is making several improvements to our widely used SRA database. SRA is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all organisms as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enable reproducibility and facilitate new discoveries through data analysis. 

What improvements is NCBI making?

  • More transparent: We recently launched the GenBank and SRA Data processing page to help you better understand how sequence data are submitted, processed, and made publicly available. 
  • More efficient: Faster data transfers, downloads, and analyses! We will be incrementally streamlining how you access SRA data as SRA Lite becomes the standard SRA file format. This simplified format reduces the average file size for more efficient analysis and storage of large datasets. 
  • More reliable: A trusted source! SRA is a trustworthy database, and we are continuously improving our processes to ensure system reliability.   
  • And more!  

Continue reading “3+ Ways NCBI is Enhancing the SRA Database”

Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions

Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions

Do you work with human-derived sequence data? Do you often struggle with the need to determine if your data is free of human sequence and therefore suitable for public distribution? We encourage submitters to screen for and remove contaminating human reads from data files prior to submission to SRA. To support investigators in this effort, we offer a tool to remove human sequence contamination from your SRA submissions!

Human Read Removal Tool (HRRT)

The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with ‘N’. Continue reading “Scrubbing human sequence contamination from Sequence Read Archive (SRA) submissions”

Announcing the GenBank and SRA Data Processing Webpage

Announcing the GenBank and SRA Data Processing Webpage

Interested in understanding how sequence data are submitted, processed, and made publicly available in GenBank and the Sequence Read Archive (SRA)? Announcing the GenBank and SRA Data Processing webpage!

Here you can learn about procedures that the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), uses for processing submitted data and public posting, as well as key definitions of data status. Continue reading “Announcing the GenBank and SRA Data Processing Webpage”

dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!

Are you familiar with the well-known Framingham Heart Study, a multi-generation study of residents of Framingham, Massachusetts begun in 1948? Much of what is now known about the impact of genetics, lifestyle, and diet on cardiovascular health and disease has come from this research study. (See PMC4159698  for a historical perspective.) Did you know that data from this study and over 2,000 other studies that demonstrate the relationship between genetic and medical outcomes and other phenotypes are available from NCBI’s Database of Genotypes and Phenotypes (dbGaP)?

dbGaP was established in 2007 as a repository of human data from large scale studies. You can access data from more than 2.8 million study participants who have provided over 3.3 million molecular samples. You can retrieve patient-level phenotypic (e.g., demographic, clinical, exposure) data and molecular (e.g., called genotypes omics, sequence) data, and the results of association analyses from genome-scale case-control and longitudinal studies of heritable diseases.

What types of studies and data are available in dbGaP?

dbGaP contains a wide range of studies and types of data, all relating to human genetic and phenotypic measurements. Most dbGaP data are from NIH-funded research, but recently we have expanded to include non-NIH funded studies. An easy way to find dbGaP Studies, Phenotype and Molecular Datasets, Variables, Analyses and Documents is through the dbGaP Advanced Search (Figure 1). The interface allows you to filter results by different characteristics depending on the tab you choose.

Figure 1. The dbGaP Advanced Search interface. Tabs that appear at the top of the web interface allow you to select the studies, datasets, analyses, etc. of interest. Filters (facets) appear on the left (see inset). Click on filters to select values to find Links on the study summary pages provide direct access to data. Top panel:  Studies tab and the corresponding filter categories.  Bottom panel: Molecular data tab results with Study (Framingham SHARe), Markerset Source (Affymetrix) filters applied. 

Continue reading “dbGaP: Data and analyses from millions of study participants, samples, and trillions of genotypes!”

Monkeypox virus: Complete genome from the current outbreak now available in GenBank

Monkeypox virus: Complete genome from the current outbreak now available in GenBank

The first complete genome sequence of the current monkeypox virus (MPXV) outbreak (isolate name MPXV_USA_2022_MA001) is now available with accession ON563414 in GenBank, a public database of DNA sequences hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM).

Several cases of monkeypox have been identified in geographically widespread countries. Monkeypox is classified as a zoonotic disease where transmission of the virus is usually due to animal-human contact. Genetically, monkeypox viruses cluster into two groups: the Congo basin and the west African clade. This particular outbreak has been identified as due to a virus from the west African clade which is often associated with milder disease and, in this case, human-to-human spread is suspected. Continue reading “Monkeypox virus: Complete genome from the current outbreak now available in GenBank”

Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 

Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 

The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) has released a new resource, called the SARS-CoV-2 Variants Overview, that aggregates data related to SARS-CoV-2 variants from sequences available in NCBI’s GenBank and Sequence Read Archive (SRA) databases.

SARS-CoV-2 Variants Overview, a freely available online dashboard, was developed with guidance from the TRACE Working Group as part of NLM’s participation in the National Institutes of Health (NIH) Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV) initiative, a public-private partnership for a coordinated research strategy to support and speed up the development of COVID-19 treatments and vaccines.

One impetus for development of the dashboard is that unassembled SRA data cannot be processed through Pango tools, and many SARS-CoV-2 samples are only represented in SRA. The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. Thus, we developed a uniform approach to making variant calls from SRA records and assigning Pangolin lineages on the basis of these results. This means that submission groups do not have to go through the effort of creating assemblies. Continue reading “Introducing SARS-CoV-2 Variants Overview, NLM’s latest tool in the fight against COVID-19 “

BLAST+ 2.13.0 now available with SRA BLAST, ARM Linux executables, and database metadata

BLAST+ 2.13.0 now available with SRA BLAST, ARM Linux executables, and database metadata

BLAST+ 2.13.0  includes several important new features including SRA BLAST programs, ARM Linux executables, and the ability to produce database metadata as well as some important improvements, and a few bug fixes.  You can download the new BLAST release from the FTP site.

New features

SRA / WGS BLAST (blastn_vdb, tblastn_vdb)

Beginning with this release, the BLAST distribution now includes the SRA BLAST programs  blastn_vdb and tblastn_vdb that can directly search SRA and WGS projects without the need to build a BLAST database. See the BLAST documentation on how to use these programs with WGS projects.

ARM Linux executables

This release also includes executables compiled under ARM Linux for the first time. Please let us know if you find any issues with ARM Linux programs.

Database metadata in JSON format

Starting with BLAST+ 2.13.0, the makeblastdb program generates an additional file with the file extension .njs for nucleotide databases or .pjs  for protein databases. These files contain BLAST database metadata in JSON format. See the BLAST database metadata section in the BLAST User Manual for an example. This file can be easily read by many tools and makes the BLAST database more compliant with FAIR principles.

See the release notes for more details on improvements and bug fixes for the release.

Important reminder about usage reporting

As we announced previously, BLAST can report limited usage information back to NCBI. This information shows us whether BLAST+ is being used by the community, and therefore is worth being maintained and developed.  It also allows us to focus our development efforts on the most used aspects of BLAST+.  Please help us improve BLAST by allowing BLAST to share information about your search. The BLAST privacy statement  provides details on the information collected, how it is used, and how to opt-out of reporting if you don’t want to participate.

NCBI Trace database to be retired in June 2022. Data available in SRA.

NCBI Trace database to be retired in June 2022. Data available in SRA.

The Trace Archive at NCBI will be retired as of June 17, 2022. You may continue to retrieve Trace Archive content by searching the Sequence Read Archive (SRA) using TI number, organism, or center name at the time of retirement.

Continue reading “NCBI Trace database to be retired in June 2022. Data available in SRA.”