NCBI staff will be presenting virtual posters at the Cold Spring Harbor Laboratory Biology of Genomes Meeting, May 11 -14, 2021. The posters will cover the following topics: 1) a cloud-ready suite of tools (PGAP, RAPT , and SKESA) for assembling and annotating prokaryotic genomes, 2) Datasets — a new set of services for downloading genome assemblies and annotations, and 3) updates on NCBI RefSeq eukaryotic genome annotation, and the Genome Data Viewer (GDV). Read more below for the full abstracts.
The virtual poster gallery opens Tuesday, May 11 at 9:00 a.m. with dedicated time for poster viewing and discussion at 1:00 to 2:00 p.m. through Slack each day. The poster gallery will be open for entire the conference and remain available for six weeks afterwards.
The NCBI tool suite for prokaryotic genomes: how RAPT, SKESA and PGAP can accelerate your research
Thibaud-Nissen F, Agarwala R, Arndt D, Hlavina W, Li W, Lu S, Meric P, Souvorov A, Sweeney D, Wagner L, Yang, M
NCBI has developed a suite of publicly available tools for assembling, annotating and verifying the species assignment of bacterial and archaeal genomes. RAPT brings together SKESA, an efficient de Bruijn graph assembler for Illumina short reads, and PGAP, the pipeline used for the annotation of RefSeq prokaryotic genomes. Recent workflow changes have reduced the PGAP and RAPT runtime by half, so that a user can now assemble a genome from sequencing reads and annotate the structure and function of genes on the resulting assembly in minutes to a couple of hours, using a single command.
Docker images for PGAP and RAPT are available on dockerhub, and can run on a local computer, a private cluster or in a cloud environment, using intuitive command-line interfaces. The images contain the PGAP CWL workflow, all necessary binaries (including SKESA in the case of RAPT) and cwltool, the CWL reference implementation. All necessary reference data, including a variety of manually curated evidence are bundled and distributed with PGAP and RAPT.
A special implementation of RAPT for users of the Google Cloud Platform that makes use of the Google Life Sciences API is also available. With a single command from the Google Cloud Shell, GCP RAPT secures a virtual machine, downloads the Docker image and data needed, assembles, verifies the taxonomic assignment, annotates the genome, places the output in the desired bucket and shuts down the virtual machine.
Finally, we will present a pilot web service for RAPT, aimed at helping biologists without the technical skills or the access to compute resources answer their scientific questions, and at understanding their needs for prokaryotic genomics tools and data.
NCBI Datasets: Get the genome-related data you want, the way you want
VA Schneider, E Cox, PA Meric, JB Holmes, and NA O’Leary
For researchers performing genomic analyses, NCBI is recognized as one of the pre-eminent public archival collections from which sequences, assemblies, annotations, and metadata for organisms across the tree of life can be freely retrieved. As the volume and complexity of data grows, it becomes increasingly important to provide access mechanisms that allow researchers to find the data they need efficiently and effectively. Furthermore, researchers need infrastructure and data that adhere to FAIR principles (Findable, Accessible, Interoperable, Reusable) to ensure the usability of the data and the quality of their analyses. NCBI Datasets is a new resource focused on these needs, developed specifically to make it easy for researchers to get the data they want, so they can use it. We will show how Datasets offers web-based, command line and API access to genome and gene-related sequence content and metadata from all branches of the taxonomic tree. We will review the structure of genome datasets that include genome, transcript, and protein sequence, annotation, and a JSON-lines formatted data report of genome metadata. We will also introduce the dataformat tool that is provided to transform JSON-lines into a tabular report. We will present on other NCBI Datasets packages that are also available, including for genes and ortholog data, and for those studying SARS-CoV-2, a package that includes genomic, protein and CDS sequences, annotation and a comprehensive data report for all complete SARS-CoV-2 genomes. Finally, we’ll present the Datasets python and R libraries that allow researchers to access the APIs, facilitating their use in analysis workflows, and the companion Jupyter and R notebooks that are provided to help researchers get started with these tools. As a resource under active development, we’ll share the latest improvements and features.
Annotating genomes at NCBI RefSeq in the era of 3rd generation sequencing
Terence D Murphy, Françoise Thibaud-Nissen
Advances in sequencing technology over the last decade have led to a cornucopia of genome assemblies for multicellular eukaryotes. Many species have new, high-quality assemblies based on PacBio, Oxford Nanopore (ONT), or other technologies along with abundant RNA-seq datasets, generated by many researchers from around the world. To help maximize the utility of these genomes for the research community, NCBI’s Reference Sequence (RefSeq) project provides genome annotations for over 700 species spanning over 350 vertebrates, 200 invertebrates, and 100 plants. NCBI’s automated annotation pipeline provides rapid, high-quality gene annotations across many taxa, with consistent processing that benefits comparative genomic studies. Annotation sets typically exceed 97% completeness as measured by BUSCOv4, surpassing most other datasets. Annotations are available in NCBI’s Gene resource, BLAST databases, and Genome Data Viewer (GDV). Gene and GDV also provide access to other genomic information including orthologs, RNA-seq expression data, and whole genome alignments to previous assembly versions or assemblies from different strains. This presentation will explore the lessons we’ve learned from annotating a diverse collection of genomes, including the impacts of RNA-seq and assembly quality, demonstrate the high quality of the annotated gene sets, and give an overview of NCBI’s resources. The eukaryotic genome annotation and Genome Data Viewer pages provide more information.