Do you currently access genome assembly data through the FTP site? We are consolidating information provided in the ASSEMBLY_REPORTS and GENOME_REPORTS directories on the genomes FTP site to simplify access and ensure that you have the most accurate, up to date, and consistently reported data.
The assembly_summary files in the ASSEMBLY_REPORTS directory are gaining information in newly added columns 24-38, including statistics about the assembly (size, GC content, genome size, and number of sequences) as well as details about the provided annotation (number of genes, annotation name and date). See example below (Table 1). Check out the README for more details about the contents of the summary files.
Column | Header | Entry |
24 | assembly_type | haploid-with-alt-loci |
25 | group | vertebrate_mammalian |
26 | genome_size | 3099441038 |
27 | genome_size_ungapped | 2948318359 |
28 | gc_percent | 40.5 |
29 | replicon_count | 24 |
30 | scaffold_count | 470 |
31 | contig_count | 35611 |
32 | annotation_provider | NCBI RefSeq |
33 | annotation_name | GCF_000001405.40-RS_2023_03 |
34 | annotation_date | 03/15/23 |
35 | total_gene_count | 59444 |
36 | protein_coding_gene_count | 20080 |
37 | non_coding_gene_count | 21954 |
38 | pubmed_id | 11237011;15496913;… |
Table 1. An example of new information added to the assembly_summary_refseq.txt file for the human assembly GCF_000001405.40
We previously reported this information in separate files under the GENOME_REPORTS directory (prokaryotes.txt, eukaryotes.txt, viruses.txt) using an older process that wasn’t as accurate or comprehensive as the new files. The old files will be removed in September 2023.
Did you know? You can also access most of this data through NCBI Datasets, an alternative to FTP downloads. Check out the NCBI Datasets genome assembly report, a JSON file included in all genome package downloads that can be accessed by web, command line or API. The report consolidates the data from the assembly_summary file described here, along with valuable assembly related metadata from other NCBI databases, including Taxonomy, BioProject, and BioSample.
Stay up to date
Follow us on Twitter @NCBI and join our mailing list to keep up to date with RefSeq and other NCBI news.
Questions?
If you have questions or would like to provide feedback, please reach out to us at [email protected].
Thank you I appreciate the update!
Michael Wichkoski