Removing contaminated sequences using NCBI quality assurance tools
Do you use BLAST to identify a sequence or the evolutionary scope of a gene? That can be challenging if contaminated and misclassified sequences are in the BLAST databases and show up in your search results. To address this problem, we now use the NCBI quality assurance tools listed below to systematically remove these misleading sequences from the default nucleotide (nt) and protein (nr) BLAST databases.
- Foreign Contamination Screen tool for genome cross-species screening (FCS-GX) detects contamination from foreign organisms in genomes and other sequences using the genome cross-species aligner (GX)
- Average Nucleotide Identity (ANI) evaluates the taxonomic classification of prokaryotic genome assemblies. Sequences from genomes marked up as ‘unverified source organism’ are considered suspect and removed.
This process has removed approximately 2.23% of sequences from nr and 0.01% from nt. Lists of nucleotide and protein sequences identified as contaminant or misclassified are available from our FTP site.
Stay up to date
BLAST is part of the NIH Comparative Genomics Resource (CGR). CGR facilitates reliable comparative genomics analyses for all eukaryotic organisms through an NCBI Toolkit and community collaboration.
Follow us on social @NCBI and join our mailing list to keep up to date with BLAST and other CGR news.
Questions?
We want to hear from you! Try it out and let us know what you think. We are making ongoing improvements based on your feedback. If you have questions or would like to provide feedback, please reach out to us at info@ncbi.nlm.nih.gov.