Tag: Cloud computing

Changes to SRA Data Access on Amazon Web Services (AWS) and Google Cloud Platform (GCP)

Changes to SRA Data Access on Amazon Web Services (AWS) and Google Cloud Platform (GCP)

Important note! The storage tier for Sequence Read Archive (SRA) data available through Amazon Web Services (AWS) commercial buckets is transitioning to Glacier Instant retrieval and Google Cloud Platform (GCP) is transitioning to Coldline. This change is projected to be complete by the end of October 2024. To mitigate the cost impact of this change, we recommend adjusting your data access workflow to utilize the SRA Toolkit for accessing SRA data from AWS or GCP.  

Please note this change does not impact SRA data access from NCBI servers or AWS Open Data Program.     Continue reading “Changes to SRA Data Access on Amazon Web Services (AWS) and Google Cloud Platform (GCP)”

Changes to SRA Data Access on Amazon Web Services (AWS)

Changes to SRA Data Access on Amazon Web Services (AWS)

Cost-effective alternatives for accessing SRA data  

Important note! The storage tier for Sequence Read Archive (SRA) data available through Amazon Web Services (AWS) commercial buckets is transitioning to Infrequent Access. This change is projected to be complete by the end of September 2024. To mitigate the cost impact of this change, we recommend adjusting your data access workflow to utilize the SRA Toolkit for accessing SRA data. Read more. 

Please note this change does not impact SRA data access from Google Cloud Platform (GCP) or NCBI servers.    Continue reading “Changes to SRA Data Access on Amazon Web Services (AWS)”

Streamlining Access to SRA COVID-19 Datasets on the Cloud

Streamlining Access to SRA COVID-19 Datasets on the Cloud

To make it easier for you to find and access Sequence Read Archive (SRA) data, we are re-organizing and improving our cloud storage systems.  

Beginning April 2023, we will move the SARS-CoV-2 normalized data and source files from the COVID-19 data buckets on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to the NIH NCBI SRA on AWS registry. We will also remove the SARS-CoV-2 original format data from AWS and GCP COVID-19 buckets and make them available in AWS cold storage. If you need these data, you can request them using the Cloud Data Delivery Service (CDDS). 

Where and how will I be able to access SARS-CoV-2 normalized data after this change occurs?

To ensure a smooth transition, we want you to have enough time to adjust your scripts and pipelines to minimize disruption to your analyses.   Continue reading “Streamlining Access to SRA COVID-19 Datasets on the Cloud”

Full-scale access to microbial Pathogen Detection data in the Cloud!

Full-scale access to microbial Pathogen Detection data in the Cloud!

NCBI’s Pathogen Detection resource now provides selected data on the Google Cloud Platform (GCP) allowing you better access to over 1 million bacterial isolates.

Data on GCP include:

  1. The tables from the MicroBIGG-E database of anti-microbial resistance (AMR), stress response, virulence genes, and genomic elements and the Pathogen Isolates Browser that are both accessible through Google BigQuery.
  2. The MicroBIGG-E sequences in FASTA format that are available from Google Cloud Storage.

Features & Benefits

Pathogen Detection data on GCP allows you larger-scale access than is currently available through the web or from FTP.  Notably, there is no FTP access to MicroBIGG-E; the web interface is limited to 100K rows and sequence downloads are restricted.  There are no such restrictions on GCP. MicroBIGG-E at BigQuery also allows you to download all AMRFinderPlus results. Currently there are more than 20 million rows of antimicrobial resistance, virulence, and stress response genes, and point mutations, identified in more than 1 million pathogen isolates.

Here are two examples where researchers have used MicroBIGG-E and AMFinderPlus data to advance research on antimicrobial resistance:

    • Identifying conserved functional regions in erythromycin resistance methyltransferases (PMID: 34795028).
    • Assessing the health risks of antibiotic resistance genes (PMCID: PMC8346589).

Continue reading “Full-scale access to microbial Pathogen Detection data in the Cloud!”

NCBI Workshop at the ASM NGS 2022 Meeting

NCBI Workshop at the ASM NGS 2022 Meeting

NCBI Microbial Pathogen and SARS-CoV-2 Resources in the Cloud

Get hands-on experience with NCBI Pathogen Detection and SARS-CoV-2 Surveillance data in the cloud. No prior cloud experience necessary!

NCBI staff are presenting a workshop at the American Society for Microbiology Next-Generation Sequencing (ASM NGS) 2022 Meeting on Sunday, October 16, 2022 from 10 am – 3 pm ET (with a 1 hour break) to help conference attendees learn about two NCBI cloud-hosted resources, Pathogen Detection and SARS-CoV-2 Genome Sequence datasets. Continue reading “NCBI Workshop at the ASM NGS 2022 Meeting”

Top 3 reasons to use ElasticBLAST

Top 3 reasons to use ElasticBLAST

ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:

1. ElasticBLAST can handle much LARGER queries! 

ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.

2. ElasticBLAST is FASTER

ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.

3. ElasticBLAST is EASY to run on the cloud

ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP)) and allows you to leverage the power of the cloud. Once configured, it manages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.

ElasticBLAST also selects the instance (i.e., machine) type for you based on database size. Of course, you can also choose the instance type manually if you prefer Continue reading “Top 3 reasons to use ElasticBLAST”

Introducing ElasticBLAST – BLAST® is now easier, bigger, and faster on the Cloud!

Introducing ElasticBLAST – BLAST® is now easier, bigger, and faster on the Cloud!

ElasticBLAST is a new tool that helps you run BLAST searches on the cloud. ElasticBLAST is perfect for you if you have thousands to millions of queries to our Basic Local Alignment Search Tool (BLAST ®), or if you want to use cloud infrastructure for your searches. ElasticBLAST can handle large searches that are not appropriate for NCBI web BLAST, and it runs them more quickly than stand-alone BLAST+.

ElasticBLAST works on two of the current NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) partners- Amazon Web Services (AWS) and Google Cloud Platform (GCP).  ElasticBLAST works by distributing your searches across multiple cloud instances to process them in tandem. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+. ElasticBLAST can handle millions of queries, and it also supports most BLAST+ options and programs.

Making it easier to run BLAST on the cloud

ElasticBLAST reduces the barrier to using the cloud by creating and managing cloud resources for you. It manages the software and database installation, handles partitioning of the BLAST workload among the various instances and deallocates cloud resources when the searches are done. For example, ElasticBLAST will select the best cloud instance type for your search based on the database metadata that provides database size and memory needs (Figure 1). You can also manually select the instance type if you prefer.

Fig. 1: JSON metadata for the 16S_ribosomal_RNA database. The “bytes-to-cache” information helps ElasticBLAST pick out an instance with the appropriate capacity.

Selecting Databases

ElasticBLAST can access the 28 NCBI databases available on AWS and GCP. These are the same databases that are also available from the NCBI FTP site. For instance, databases available on the two cloud providers include the RefSeq Eukaryotic Representative Genomes database, 16S database based on Targeted Loci, and Human and mouse genomes databases.

You can also provide your own databases, and you can produce the metadata needed to select an instance through a Python script that comes with ElasticBLAST.

Example Runs

ElasticBLAST can perform a variety of searches with query sets that range from hundreds to millions of sequences and BLAST databases of all sizes.  Table 1 shows ElasticBLAST searches with query sets that range up to billions of letters using a variety of BLAST databases.

Table 1: Sample ElasticBLAST searches.  This table demonstrates the breadth of searches supported by ElasticBLAST.  Additionally, the first row demonstrates the ability of ElasticBLAST to use many CPUs (3200) on a cloud provider at once to complete a task in hours that would have taken days on a single machine.

Costs

Because ElasticBLAST runs on cloud providers, using it will incur some cost. Based on current cost structures on AWS and GCP, in most cases these costs are quite small. For example, a protein search with a query of about 20 million residues against a database of about 20 billion residues can cost less than $5. Even a larger search with a query of 3-4 billion DNA bases can cost only around $50. Both cloud services include the option to bid on instances for less than full price, which can result in significant savings. ElasticBLAST can be configured to request such instances. Your costs will obviously vary based on many factors, and we encourage you to explore these options with the individual cloud providers. Also, both AWS and GCP offer a free tier or time-limited trial of their cloud services, and you can find information about using ElasticBLAST with the free tiers here.

Welcome to ElasticBLAST!

Go ahead and run your first ElasticBLAST search! We are sure you’ll love how ElasticBLAST accelerates your research.

Your feedback is crucial to the development and support of ElasticBLAST. If you have any questions or suggestions, please reach out to us at [email protected]. We’d love to hear from you.

ElasticBLAST is a cloud-native package developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) with support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Feb 16 Webinar: accelerating your alignments in the Cloud with ElasticBLAST

Feb 16 Webinar: accelerating your alignments in the Cloud with ElasticBLAST

Join us on February 16, 2022 at 12 PM US eastern time to learn about ElasticBLAST, a new tool that runs your BLAST searches on cloud hardware, using the standard BLAST command-line package. You will hear about the benefits of ElasticBLAST, which include speed and ease of use. You will also see some practical applications of this tool and how you can try it out yourself.

    • Date: Wed, February 16, 2022
    • Time: 12:00 PM – 12:45 PM EDT
    • Register

After registering, you will receive a confirmation email with information about attending the webinar. A few days after the live presentation, you can view the recording on the NCBI webinars playlist on the NLM YouTube channel. You can learn about future webinars on the Outreach Events page.

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket

The Sequence Read Archive (SRA) is the National Institute of Health’s (NIH) primary repository for raw, high-throughput sequencing data, containing both controlled- and open-access datasets that continue to grow exponentially. SRA is managed by the National Library of Medicine’s National Center for Biotechnology Information (NCBI), and the data are available from NCBI’s servers as well as through cloud platforms:  Amazon Web Services (AWS) and Google Cloud Platform (GCP).  Cloud access was made possible by support from NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

Due to SRA’s exponential growth in size, data in the cloud environment are currently partitioned in hot and cold storage to keep SRA sustainable and accessible. Per industry standards, data in hot storage are immediately accessible; because hot storage is more expensive to host, we make efforts to align this distribution method with our most frequently requested datasets. The less frequently requested datasets are available in cold storage, which may not be immediately accessible. Fret not! SRA is constantly evolving to meet our users’ needs. NCBI’s Cloud Data Delivery Service (CDDS) now allows you to get public and controlled-access data delivered from cold and hot storage directly to your chosen cloud bucket in just a few hours. The minor cost is currently handled by NCBI but certain limits apply; within a 30-day request cycle, users are able to request up to 5TB from cold storage and 20TB from hot storage to their cloud bucket.

Continue reading “NIH’s Cloud Data Delivery Service: SRA Delivers Even More Big Data to your Cloud Bucket”

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

Learn the best way to find data in NIH’s Sequence Read Archive (SRA) on the cloud

NCBI will present a workshop at the American Society for Human Genetics (ASHG) as part of their conference activities in 2021. The workshop is scheduled for Wednesday, September 15, 2021.

Register now!

Adelaide Rhodes, Ph.D. from the Customer Experience team and Adam Stine, SRA Curator will co-lead the workshop, which will introduce attendees to powerful metadata searches on BigQuery on Google Cloud Platform (GCP) and Athena on Amazon Web Services (AWS) to speed up analytic workflows using the NIH’s Sequence Read Archive (SRA).

Cloud-based query services with expanded metadata options for SRA help researchers to find the target data more quickly than ever before. The workshop will be a mix of training in Structured Query Language (SQL), demos on the cloud console and hands-on exercises in Jupyter notebooks with examples to help researchers understand how to build searches in SQL. Researchers who attend this workshop will learn how to extract specific data sets as well as how to conduct exploratory analysis of the entirety of the SRA data available in the cloud.

Both BigQuery and Athena require SQL but no prior SQL experience is required. By the end of this workshop you will know how to run cloud metadata queries using SQL to find SRA data based on parameters that are of interest to you.

Adam Stine, Ph.D., SRA Curator
Adelaide Rhodes, Ph.D., Customer Experience