################################################################################ README for the fcs*.txt.gz files found on the NCBI genomes FTP site: https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/ fcs_summary_refseq.txt.gz fcs_summary_genbank.txt.gz fcs_details_refseq.txt.gz fcs_details_genbank.txt.gz Last updated: August 19, 2024 ################################################################################ BACKGROUND: NCBI's Foreign Contamination Screen (FCS) is available from GitHub: https://github.com/ncbi/fcs The aggregate reports in this directory summarize results from the NCBI contamination screens run on public GenBank and RefSeq assemblies. In addition, each assembly directory on FTP has an individual contamination report. Aggregate files are broken down by: 1. genbank - GenBank (GCA_ prefix) assemblies 2. refseq - RefSeq (GCF_ prefix) assemblies ========================================================= COLUMN SPECIFICATIONS FOR AGGREGATE CONTAMINATION REPORTS ========================================================= 1. fcs_summary*.txt.gz - summary of assembly details and found contamination columns are: assembly_accession[ 1 ]: assembly accession.version fcs_gx_software_rev[ 2 ]: FCS-GX software version producing the contamination report fcs_gx_db_rev[ 3 ]: FCS-GX database version producing the contamination report fcs_gx_run_date[ 4 ]: FCS-GX run date asserted_division[ 5 ]: FCS-GX declared taxonomic division top_inferred_division[ 6 ]: FCS-GX top hit taxonomic division fcs_gx_aggregate_coverage[ 7 ]: FCS-GX aggregate genome coverage by database sequences total_contaminant_accessions[ 8 ]: count of unique contaminant accessions (EXCLUDE|FIX|TRIM|REVIEW_RARE) total_contaminant_length[ 9 ]: total contaminant length (bp) fraction_contaminated[ 10 ]: fraction of total genome length assigned as contamination (0-1) exclude_ct[ 11 ]: EXCLUDE count fix_ct[ 12 ]: FIX span count trim_ct[ 13 ]: TRIM span count rare_ct[ 14 ]: REVIEW_RARE span count review_ct[ 15 ]: REVIEW span count info_ct[ 16 ]: INFO span count adaptor_vector_ct[ 17 ]: adaptor/vector span count organelle_ct[ 18 ]: organelle span count anml[ 19 ]: animal contamination (bp) anml_minus_primates[ 20 ]: non-primate animal contamination (bp) fung[ 21 ]: fungi contamination (bp) plnt[ 22 ]: plant contamination (bp) prst[ 23 ]: protist contamination (bp) prok[ 24 ]: bacteria contamination (bp) arch[ 25 ]: archaea contamination (bp) virs[ 26 ]: virus contamination (bp) synt[ 27 ]: synthetic contamination (bp) adptor_vector[ 28 ]: adaptor/vector contamination (bp) 2. fcs_details*.txt.gz - details of found contaminants columns are: seq_id[ 1 ]: sequence accession.version start_pos[ 2 ]: contaminant start (1 based) end_pos[ 3 ]: contaminant end (1 based) seq_len[ 4 ]: total sequence length action[ 5 ]: EXCLUDE|FIX|TRIM|REVIEW|REVIEW_RARE|INFO|MITOCHONDRION|PLASTID contam_type[ 6 ]: contaminant type coverage[ 7 ]: GX contaminant coverage (1-100%). Sequences with only adaptor/vector contamination have coverage value set to 0. contam_details[ 8 ]: GX top sequence hit (n/a = low signal) assembly_accession[ 9 ]: assembly accession.version =================================================================== COLUMN SPECIFICATIONS FOR INDIVIDUAL ASSEMBLY CONTAMINATION REPORTS =================================================================== A contamination report file with the following columns is provided in each assembly directory on FTP: seq_id[ 1 ]: sequence accession.version start_pos[ 2 ]: contaminant start (1 based) end_pos[ 3 ]: contaminant end (1 based) seq_len[ 4 ]: total sequence length action[ 5 ]: EXCLUDE|FIX|TRIM|REVIEW|REVIEW_RARE|INFO contam_type[ 6 ]: contaminant type coverage[ 7 ]: GX contaminant coverage (1-100%) contam_details[ 8 ]: GX top sequence hit (n/a = low signal) ==================================== HOW TO USE THE CONTAMINATION REPORTS ==================================== Retrieve a set of genomes for a taxonomic group of interest, filtering out those with high levels of contamination. For example, run the NCBI datasets and dataformat commands to retrieve genome metadata for all latest GenBank bird genomes: datasets summary genome taxon 'birds' --assembly-source genbank --assembly-version 'latest' --as-json-lines | dataformat tsv genome --fields accession > birds.asm.list Filter genomes to keep those with less than 5% contamination: zcat fcs_summary_genbank.public_only.txt.gz | fgrep -w -f birds.asm.list | awk -v FS='\t' -v OFS='\t' '$10<0.05' > birds.accept.list Filter genomes to keep those with less than 100 kbp contamination: zcat fcs_summary_genbank.public_only.txt.gz | fgrep -w -f birds.asm.list | awk -v FS='\t' -v OFS='\t' '$9<100000' > birds.accept.list Using individual contamination reports Run the NCBI datasets command to download a genome package datasets download genome accession GCF_014824575.1 --filename Sturnia_hondurensis_dataset.zip unzip Sturnia_hondurensis_dataset.zip Option 1: Clean the genome with NCBI FCS. This will remove sequences at EXCLUDE|TRIM regions and hardmask contaminant ranges at FIX regions https://github.com/ncbi/fcs curl -LO https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/824/575/GCF_014824575.1_WHU_Shon_v2/GCF_014824575.1_WHU_Shon_v2_fcs_report.txt cat ncbi_dataset/data/GCF_014824575.1/GCF_014824575.1_WHU_Shon_v2_genomic.fna | python3 ./fcs.py clean genome --action-report GCF_014824575.1_WHU_Shon_v2_fcs_report.txt --output clean.fasta --contam-fasta-out contam.fasta Note that by default, accessions marked EXCLUDE|FIX|TRIM are removed using fcs.py clean genome, while accessions marked REVIEW|REVIEW_RARE|INFO|MITOCHONDRION|PLASTID are ignored. This default is a conservative approach to contamination cleanup. If you wish to be more aggressive, do these substitutions before cleaning the genome: awk -v FS='\t' -v OFS='\t' '{ if ($5 ~ /REVIEW/ && $2 == 1 && $3 == $4) gsub(/REVIEW.*/, "EXCLUDE", $5); print }' fcs_report.txt awk -v FS='\t' -v OFS='\t' '{ if ($5 ~ /REVIEW/ && ($2 == 1 && $3 != $4 || $2 != 1 && $3 == $4)) gsub(/REVIEW.*/, "TRIM", $5); print }' fcs_report.txt awk -v FS='\t' -v OFS='\t' '{ if ($5 ~ /REVIEW/ && $2 != 1 && $3 != $4) gsub(/REVIEW.*/, "FIX", $5); print }' fcs_report.txt Option 2: Clean the genome with bedtools. This will hardmask sequences at all contaminant ranges. Note that the fcs_report.txt can include rows marked REVIEW|REVIEW_RARE|INFO|MITOCHONDRION|PLASTID which may or may not be of interest to mask, and you can grep out labels you wish to ignore. https://bedtools.readthedocs.io/en/latest/ Cleang the genome using the individual fcs report file: grep -v '#' GCF_014824575.1_WHU_Shon_v2_fcs_report.txt | awk -v FS='\t' -v OFS='\t' '{print $1,$2-1,$3}' > GCF_014824575.1_fcs.bed bedtools maskfasta -fi ncbi_dataset/data/GCF_014824575.1/GCF_014824575.1_WHU_Shon_v2_genomic.fna -bed GCF_014824575.1_fcs.bed -fo GCF_014824575.1_clean.fna Clean the genome with bedtools using the aggregate fcs report file: zcat fcs_details_refseq.txt.gz | grep -w GCF_014824575.1 | awk -v FS='\t' -v OFS='\t' '{print $1,$2-1,$3}' > GCF_014824575.1_fcs.bed bedtools maskfasta -fi ncbi_dataset/data/GCF_014824575.1/GCF_014824575.1_WHU_Shon_v2_genomic.fna -bed GCF_014824575.1_fcs.bed -fo GCF_014824575.1_clean.fna ________________________________________________________________________________ National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, MD 20894, USA tel: (301) 496-2475 fax: (301) 480-9241 e-mail: info@ncbi.nlm.nih.gov ________________________________________________________________________________