Skip to content

shenwei356/ictv-taxdump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)

Metagenomic tools like Kraken2, Centrifuge and KMCP support NCBI taxonomy in format of NCBI taxdump files.

While for virus, ICTV (International Committee on Taxonomy of Viruses) has its own taxonomy data.

A TaxonKit command, taxonkit create-taxdump is created to create NCBI-style taxdump files for any taxonomy dataset, including GTDB and ICTV.

Related projects:

  • gtdb-taxdump: GTDB taxonomy taxdump files with trackable TaxIds
  • taxid-changelog: NCBI taxonomic identifier (taxid) changelog
  • taxonkit: A Practical and Efficient NCBI Taxonomy Toolkit

Table of Contents

Download

The release page contains taxdump files.

Methods

Taxonomic hierarchy

Virus Metadata Resource (VMR) provides taxonomy data of each release. Most viruses have the seven-ranks (Kingdom, Phylum, Class, Order, Family, Genus, Species), while some only have parts of the ranks, like Genus and Species.

Generation of TaxIds

We hash the rank+taxon_name (in lower case) of each taxon node to uint64 using xxhash and convert it to int32.

See more details.

Data and tools

The taxonomy data is released as a .xlsx file at https://talk.ictvonline.org/taxonomy/vmr/.

TaxonKit v0.12.0 or a later version is needed. v0.14.0 or a later version is preferred. Since v0.14.0, taxonkit create-taxdump stores TaxIds in int32 following BLAST and DIAMOND, rather than uint32 in previous versions.

csvtk is used for data processing.

Steps

# download here: https://ictv.global/msl/current
file="ICTV_Master_Species_List_2024_MSL40.v1.xlsx"
sheet="MSL"

# conver xlsx to tsv
csvtk xlsx2csv "$file" -n $sheet \
    | csvtk csv2tab \
    > ictv.tsv

# remove M-BM- characters.
# https://askubuntu.com/questions/357248/how-to-remove-special-m-bm-character-with-sed
sed -i 's/\xc2\xa0/ /g' ictv.tsv

# not detected in MSL39
#
# remove leading and tailing blanks. e.g., "Escherichia phage PhaxI\t"
# remove a newline character and a space introduced by accident
csvtk replace -t -F -f "*" -p "^\s+|\s+$" ictv.tsv \
    | csvtk replace -t -F -f "*" -p "\n " -r "" \
    > ictv.clean.tsv

# choose columns, rename, and remove duplicates
csvtk cut -t ictv.clean.tsv -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
    | csvtk rename -t -f 1- -n "realm,subrealm,kingdom,subkingdom,phylum,subphylum,class,subclass,order,suborder,family,subfamily,genus,subgenus,species" \
    | csvtk uniq   -t -f 1- \
    > ictv.taxonomy.tsv
    
# ------------------- create-taxdump -----------------------

taxonkit create-taxdump ictv.taxonomy.tsv --out-dir ictv-taxdump/

# set the environmental variable for taxonkit,
# so we don't need specifiy "--data-dir ictv-taxdump" for each taxonkit command.
export TAXONKIT_DB=ictv-taxdump

Results

Set the environmental variable for taxonkit, so we don't need specifiy --data-dir ictv-taxdump for each taxonkit command.

export TAXONKIT_DB=ictv-taxdump

Check more TaxonKit commands and usages

Summary

  1. Count of all ranks (version: MSL40)

     $ taxonkit list --ids 1 \
         | taxonkit lineage -L -r \
         | csvtk freq -H -t -f 2 -n \
         | csvtk pretty -H -t
    
     no rank     1    
     subphylum   4    
     realm       7    
     kingdom     11   
     suborder    12   
     phylum      22   
     class       49   
     subgenus    86   
     order       93   
     subfamily   213  
     family      368  
     genus       3769 
     species     16215
    

Retrieving and reformating lineages

  1. The TaxId

     $ echo 'Betacoronavirus hongkongense' | taxonkit name2taxid
     Betacoronavirus hongkongense    418966335
    
  2. Complete lineage

     $ echo 418966335 | taxonkit lineage
     418966335       Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Embecovirus;Betacoronavirus hongkongense
    
     # another format
     $ echo 418966335 \
         | taxonkit lineage -t \
         | csvtk cut -Ht -f 3 \
         | csvtk unfold -Ht -f 1 -s ";" \
         | taxonkit lineage -r -n -L \
         | csvtk cut -Ht -f 1,3,2 \
         | csvtk pretty -Ht
    
     1864891977   realm       Riboviria                   
     1844659726   kingdom     Orthornavirae               
     38781089     phylum      Pisuviricota                
     1832208221   class       Pisoniviricetes             
     1393610206   order       Nidovirales                 
     218352182    suborder    Cornidovirineae             
     779314330    family      Coronaviridae               
     146452600    subfamily   Orthocoronavirinae          
     68549826     genus       Betacoronavirus             
     692402414    subgenus    Embecovirus                 
     418966335    species     Betacoronavirus hongkongense
     
     # in NCBI taxonomy
     $ echo 'Betacoronavirus hongkongense' \
         | taxonkit name2taxid --data-dir ~/.taxonkit \
         | csvtk cut -Ht -f 2 \
         | taxonkit lineage -t --data-dir ~/.taxonkit \
         | csvtk cut -Ht -f 3 \
         | csvtk unfold -Ht -f 1 -s ";" \
         | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \
         | csvtk cut -Ht -f 1,3,2 \
         | csvtk pretty -Ht
    
     10239     acellular root   Viruses           
     2559587   realm            Riboviria         
     2732396   kingdom          Orthornavirae     
     2732408   phylum           Pisuviricota      
     2732506   class            Pisoniviricetes   
     76804     order            Nidovirales       
     2499399   suborder         Cornidovirineae   
     11118     family           Coronaviridae     
     2501931   subfamily        Orthocoronavirinae
     694002    genus            Betacoronavirus 
    
  3. Reformat the lineage.

     # realm,kingdom,phylum,class,order,family,genus,species
     $ echo 418966335 \
         | taxonkit reformat2 -I 1 -f "{domain|acellular root|superkingdom|realm};{kingdom};{phylum};{class};{order};{family};{genus};{species}"
     418966335       Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Betacoronavirus hongkongense
    

Taxa with only parts of ranks

  1. Rhizidiomyces virus.

     $ echo 'Rhizidiomyces virus' | taxonkit name2taxid
     Rhizidiomyces virus     1826348565
    
     $ echo 1826348565 | taxonkit lineage
     1826348565      Rhizidiovirus;Rhizidiomyces virus
    
     $ echo 1826348565 \
         | taxonkit lineage -t \
         | csvtk cut -Ht -f 3 \
         | csvtk unfold -Ht -f 1 -s ";" \
         | taxonkit lineage -r -n -L \
         | csvtk cut -Ht -f 1,3,2 \
         | csvtk pretty -Ht
    
     156661886    genus     Rhizidiovirus
     1826348565   species   Rhizidiomyces virus
    
  2. Deltasatellite solaniflavussecundi.

     $ echo 'Deltasatellite solaniflavussecundi' | taxonkit name2taxid
     Deltasatellite solaniflavussecundi      574171172
    
     $ echo 574171172 | taxonkit lineage
     574171172       Tolecusatellitidae;Deltasatellite;Deltasatellite solaniflavussecundi
    
     $ echo 574171172 \
         | taxonkit lineage -t \
         | csvtk cut -Ht -f 3 \
         | csvtk unfold -Ht -f 1 -s ";" \
         | taxonkit lineage -r -n -L \
         | csvtk cut -Ht -f 1,3,2 \
         | csvtk pretty -Ht
    
     194525965   family    Tolecusatellitidae
     289873738   genus     Deltasatellite
     574171172   species   Deltasatellite solaniflavussecundi
    

More manipulations

See https://bioinf.shenwei.me/taxonkit/usage/.

Citation

Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006

Contributing

We welcome pull requests, bug fixes and issue reports.

License

MIT License

About

NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors