Metagenomic tools like Kraken2, Centrifuge and KMCP support NCBI taxonomy in format of NCBI taxdump files.
While for virus, ICTV (International Committee on Taxonomy of Viruses) has its own taxonomy data.
A TaxonKit command, taxonkit create-taxdump is created
to create NCBI-style taxdump files for any taxonomy dataset,
including GTDB and ICTV.
Related projects:
- gtdb-taxdump: GTDB taxonomy taxdump files with trackable TaxIds
- taxid-changelog: NCBI taxonomic identifier (taxid) changelog
- taxonkit: A Practical and Efficient NCBI Taxonomy Toolkit
The release page contains taxdump files.
Virus Metadata Resource (VMR) provides taxonomy data of each release. Most viruses have the seven-ranks (Kingdom, Phylum, Class, Order, Family, Genus, Species), while some only have parts of the ranks, like Genus and Species.
We hash the rank+taxon_name (in lower case) of each taxon node to uint64
using xxhash and convert it to int32.
See more details.
The taxonomy data is released as a .xlsx file at https://talk.ictvonline.org/taxonomy/vmr/.
TaxonKit v0.12.0 or a later version is needed.
v0.14.0 or a later version is preferred.
Since v0.14.0, taxonkit create-taxdump stores
TaxIds in int32 following BLAST and DIAMOND, rather than uint32 in previous versions.
csvtk is used for data processing.
# download here: https://ictv.global/msl/current
file="ICTV_Master_Species_List_2024_MSL40.v1.xlsx"
sheet="MSL"
# conver xlsx to tsv
csvtk xlsx2csv "$file" -n $sheet \
| csvtk csv2tab \
> ictv.tsv
# remove M-BM- characters.
# https://askubuntu.com/questions/357248/how-to-remove-special-m-bm-character-with-sed
sed -i 's/\xc2\xa0/ /g' ictv.tsv
# not detected in MSL39
#
# remove leading and tailing blanks. e.g., "Escherichia phage PhaxI\t"
# remove a newline character and a space introduced by accident
csvtk replace -t -F -f "*" -p "^\s+|\s+$" ictv.tsv \
| csvtk replace -t -F -f "*" -p "\n " -r "" \
> ictv.clean.tsv
# choose columns, rename, and remove duplicates
csvtk cut -t ictv.clean.tsv -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
| csvtk rename -t -f 1- -n "realm,subrealm,kingdom,subkingdom,phylum,subphylum,class,subclass,order,suborder,family,subfamily,genus,subgenus,species" \
| csvtk uniq -t -f 1- \
> ictv.taxonomy.tsv
# ------------------- create-taxdump -----------------------
taxonkit create-taxdump ictv.taxonomy.tsv --out-dir ictv-taxdump/
# set the environmental variable for taxonkit,
# so we don't need specifiy "--data-dir ictv-taxdump" for each taxonkit command.
export TAXONKIT_DB=ictv-taxdump
Set the environmental variable for taxonkit,
so we don't need specifiy --data-dir ictv-taxdump for each taxonkit command.
export TAXONKIT_DB=ictv-taxdump
Check more TaxonKit commands and usages
-
Count of all ranks (version: MSL40)
$ taxonkit list --ids 1 \ | taxonkit lineage -L -r \ | csvtk freq -H -t -f 2 -n \ | csvtk pretty -H -t no rank 1 subphylum 4 realm 7 kingdom 11 suborder 12 phylum 22 class 49 subgenus 86 order 93 subfamily 213 family 368 genus 3769 species 16215
-
The TaxId
$ echo 'Betacoronavirus hongkongense' | taxonkit name2taxid Betacoronavirus hongkongense 418966335 -
Complete lineage
$ echo 418966335 | taxonkit lineage 418966335 Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Embecovirus;Betacoronavirus hongkongense # another format $ echo 418966335 \ | taxonkit lineage -t \ | csvtk cut -Ht -f 3 \ | csvtk unfold -Ht -f 1 -s ";" \ | taxonkit lineage -r -n -L \ | csvtk cut -Ht -f 1,3,2 \ | csvtk pretty -Ht 1864891977 realm Riboviria 1844659726 kingdom Orthornavirae 38781089 phylum Pisuviricota 1832208221 class Pisoniviricetes 1393610206 order Nidovirales 218352182 suborder Cornidovirineae 779314330 family Coronaviridae 146452600 subfamily Orthocoronavirinae 68549826 genus Betacoronavirus 692402414 subgenus Embecovirus 418966335 species Betacoronavirus hongkongense # in NCBI taxonomy $ echo 'Betacoronavirus hongkongense' \ | taxonkit name2taxid --data-dir ~/.taxonkit \ | csvtk cut -Ht -f 2 \ | taxonkit lineage -t --data-dir ~/.taxonkit \ | csvtk cut -Ht -f 3 \ | csvtk unfold -Ht -f 1 -s ";" \ | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \ | csvtk cut -Ht -f 1,3,2 \ | csvtk pretty -Ht 10239 acellular root Viruses 2559587 realm Riboviria 2732396 kingdom Orthornavirae 2732408 phylum Pisuviricota 2732506 class Pisoniviricetes 76804 order Nidovirales 2499399 suborder Cornidovirineae 11118 family Coronaviridae 2501931 subfamily Orthocoronavirinae 694002 genus Betacoronavirus -
Reformat the lineage.
# realm,kingdom,phylum,class,order,family,genus,species $ echo 418966335 \ | taxonkit reformat2 -I 1 -f "{domain|acellular root|superkingdom|realm};{kingdom};{phylum};{class};{order};{family};{genus};{species}" 418966335 Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Betacoronavirus hongkongense
-
Rhizidiomyces virus.
$ echo 'Rhizidiomyces virus' | taxonkit name2taxid Rhizidiomyces virus 1826348565 $ echo 1826348565 | taxonkit lineage 1826348565 Rhizidiovirus;Rhizidiomyces virus $ echo 1826348565 \ | taxonkit lineage -t \ | csvtk cut -Ht -f 3 \ | csvtk unfold -Ht -f 1 -s ";" \ | taxonkit lineage -r -n -L \ | csvtk cut -Ht -f 1,3,2 \ | csvtk pretty -Ht 156661886 genus Rhizidiovirus 1826348565 species Rhizidiomyces virus -
Deltasatellite solaniflavussecundi.
$ echo 'Deltasatellite solaniflavussecundi' | taxonkit name2taxid Deltasatellite solaniflavussecundi 574171172 $ echo 574171172 | taxonkit lineage 574171172 Tolecusatellitidae;Deltasatellite;Deltasatellite solaniflavussecundi $ echo 574171172 \ | taxonkit lineage -t \ | csvtk cut -Ht -f 3 \ | csvtk unfold -Ht -f 1 -s ";" \ | taxonkit lineage -r -n -L \ | csvtk cut -Ht -f 1,3,2 \ | csvtk pretty -Ht 194525965 family Tolecusatellitidae 289873738 genus Deltasatellite 574171172 species Deltasatellite solaniflavussecundi
Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006
We welcome pull requests, bug fixes and issue reports.