################################################################################
README for the fcs*.txt.gz files found on the NCBI genomes 
FTP site: https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/
  fcs_summary_refseq.txt.gz
  fcs_summary_genbank.txt.gz
  fcs_details_refseq.txt.gz
  fcs_details_genbank.txt.gz

Last updated:  August 19, 2024
################################################################################

BACKGROUND:
NCBI's Foreign Contamination Screen (FCS) is available from GitHub:
https://github.com/ncbi/fcs

The aggregate reports in this directory summarize results from the NCBI contamination screens
run on public GenBank and RefSeq assemblies.

In addition, each assembly directory on FTP has an individual contamination report.

Aggregate files are broken down by:
1. genbank - GenBank (GCA_ prefix) assemblies
2. refseq - RefSeq (GCF_ prefix) assemblies

=========================================================
COLUMN SPECIFICATIONS FOR AGGREGATE CONTAMINATION REPORTS
=========================================================

1. fcs_summary*.txt.gz - summary of assembly details and found contamination
columns are:
            assembly_accession[  1 ]:  assembly accession.version
           fcs_gx_software_rev[  2 ]:  FCS-GX software version producing the contamination report
                 fcs_gx_db_rev[  3 ]:  FCS-GX database version producing the contamination report
               fcs_gx_run_date[  4 ]:  FCS-GX run date
             asserted_division[  5 ]:  FCS-GX declared taxonomic division
         top_inferred_division[  6 ]:  FCS-GX top hit taxonomic division
     fcs_gx_aggregate_coverage[  7 ]:  FCS-GX aggregate genome coverage by database sequences
  total_contaminant_accessions[  8 ]:  count of unique contaminant accessions (EXCLUDE|FIX|TRIM|REVIEW_RARE)
      total_contaminant_length[  9 ]:  total contaminant length (bp)
         fraction_contaminated[ 10 ]:  fraction of total genome length assigned as contamination (0-1)
                    exclude_ct[ 11 ]:  EXCLUDE count
                        fix_ct[ 12 ]:  FIX span count
                       trim_ct[ 13 ]:  TRIM span count
                       rare_ct[ 14 ]:  REVIEW_RARE span count
                     review_ct[ 15 ]:  REVIEW span count
                       info_ct[ 16 ]:  INFO span count
             adaptor_vector_ct[ 17 ]:  adaptor/vector span count
                  organelle_ct[ 18 ]:  organelle span count
                          anml[ 19 ]:  animal contamination (bp)
           anml_minus_primates[ 20 ]:  non-primate animal contamination (bp)
                          fung[ 21 ]:  fungi contamination (bp)
                          plnt[ 22 ]:  plant contamination (bp)
                          prst[ 23 ]:  protist contamination (bp)
                          prok[ 24 ]:  bacteria contamination (bp)
                          arch[ 25 ]:  archaea contamination (bp)
                          virs[ 26 ]:  virus contamination (bp)
                          synt[ 27 ]:  synthetic contamination (bp)
                 adptor_vector[ 28 ]:  adaptor/vector contamination (bp)

2. fcs_details*.txt.gz - details of found contaminants
columns are:
                        seq_id[  1 ]:  sequence accession.version
                     start_pos[  2 ]:  contaminant start (1 based)
                       end_pos[  3 ]:  contaminant end (1 based)
                       seq_len[  4 ]:  total sequence length
                        action[  5 ]:  EXCLUDE|FIX|TRIM|REVIEW|REVIEW_RARE|INFO|MITOCHONDRION|PLASTID
                   contam_type[  6 ]:  contaminant type
                      coverage[  7 ]:  GX contaminant coverage (1-100%). Sequences with only adaptor/vector contamination have coverage value set to 0. 
                contam_details[  8 ]:  GX top sequence hit (n/a = low signal)
            assembly_accession[  9 ]:  assembly accession.version

===================================================================
COLUMN SPECIFICATIONS FOR INDIVIDUAL ASSEMBLY CONTAMINATION REPORTS
===================================================================

A contamination report file with the following columns is provided in each assembly directory on FTP:
                       seq_id[  1 ]:  sequence accession.version
                    start_pos[  2 ]:  contaminant start (1 based)
                      end_pos[  3 ]:  contaminant end (1 based)
                      seq_len[  4 ]:  total sequence length
                       action[  5 ]:  EXCLUDE|FIX|TRIM|REVIEW|REVIEW_RARE|INFO
                  contam_type[  6 ]:  contaminant type
                     coverage[  7 ]:  GX contaminant coverage (1-100%)
               contam_details[  8 ]:  GX top sequence hit (n/a = low signal)

====================================
HOW TO USE THE CONTAMINATION REPORTS
====================================

Retrieve a set of genomes for a taxonomic group of interest, filtering out those with high levels of contamination.

For example, run the NCBI datasets and dataformat commands to retrieve genome metadata for all latest GenBank bird genomes:

datasets summary genome taxon 'birds' --assembly-source genbank --assembly-version 'latest' --as-json-lines | dataformat tsv genome --fields
 accession > birds.asm.list

Filter genomes to keep those with less than 5% contamination:
zcat fcs_summary_genbank.public_only.txt.gz | fgrep -w -f birds.asm.list | awk -v FS='\t' -v OFS='\t' '$10<0.05' > birds.accept.list

Filter genomes to keep those with less than 100 kbp contamination:
zcat fcs_summary_genbank.public_only.txt.gz | fgrep -w -f birds.asm.list | awk -v FS='\t' -v OFS='\t' '$9<100000' > birds.accept.list

Using individual contamination reports

Run the NCBI datasets command to download a genome package

datasets download genome accession GCF_014824575.1 --filename Sturnia_hondurensis_dataset.zip
unzip Sturnia_hondurensis_dataset.zip

Option 1: Clean the genome with NCBI FCS. This will remove sequences at EXCLUDE|TRIM regions and hardmask contaminant ranges at FIX regions 

https://github.com/ncbi/fcs

curl -LO https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/824/575/GCF_014824575.1_WHU_Shon_v2/GCF_014824575.1_WHU_Shon_v2_fcs_report.txt
cat ncbi_dataset/data/GCF_014824575.1/GCF_014824575.1_WHU_Shon_v2_genomic.fna | python3 ./fcs.py clean genome --action-report GCF_014824575.1_WHU_Shon_v2_fcs_report.txt 
--output clean.fasta --contam-fasta-out contam.fasta

Note that by default, accessions marked EXCLUDE|FIX|TRIM are removed using fcs.py clean genome, while accessions marked REVIEW|REVIEW_RARE|INFO|MITOCHONDRION|PLASTID
are ignored. This default is a conservative approach to contamination cleanup. If you wish to be more aggressive, do these substitutions before cleaning the genome:

awk -v FS='\t' -v OFS='\t' '{ if ($5 ~ /REVIEW/ && $2 == 1 && $3 == $4) gsub(/REVIEW.*/, "EXCLUDE", $5); print }' fcs_report.txt
awk -v FS='\t' -v OFS='\t' '{ if ($5 ~ /REVIEW/ && ($2 == 1 && $3 != $4 || $2 != 1 && $3 == $4)) gsub(/REVIEW.*/, "TRIM", $5); print }' fcs_report.txt
awk -v FS='\t' -v OFS='\t' '{ if ($5 ~ /REVIEW/ && $2 != 1 && $3 != $4) gsub(/REVIEW.*/, "FIX", $5); print }' fcs_report.txt

Option 2: Clean the genome with bedtools. This will hardmask sequences at all contaminant ranges. Note that the fcs_report.txt can include rows marked 
REVIEW|REVIEW_RARE|INFO|MITOCHONDRION|PLASTID which may or may not be of interest to mask, and you can grep out labels you wish to ignore.

https://bedtools.readthedocs.io/en/latest/

Cleang the genome using the individual fcs report file:

grep -v '#' GCF_014824575.1_WHU_Shon_v2_fcs_report.txt |  awk -v FS='\t' -v OFS='\t' '{print $1,$2-1,$3}' > GCF_014824575.1_fcs.bed
bedtools maskfasta -fi ncbi_dataset/data/GCF_014824575.1/GCF_014824575.1_WHU_Shon_v2_genomic.fna -bed GCF_014824575.1_fcs.bed -fo GCF_014824575.1_clean.fna

Clean the genome with bedtools using the aggregate fcs report file:

zcat fcs_details_refseq.txt.gz | grep -w GCF_014824575.1 | awk -v FS='\t' -v OFS='\t' '{print $1,$2-1,$3}' > GCF_014824575.1_fcs.bed
bedtools maskfasta -fi ncbi_dataset/data/GCF_014824575.1/GCF_014824575.1_WHU_Shon_v2_genomic.fna -bed GCF_014824575.1_fcs.bed -fo GCF_014824575.1_clean.fna

________________________________________________________________________________
National Center for Biotechnology Information (NCBI)
National Library of Medicine
National Institutes of Health
8600 Rockville Pike
Bethesda, MD 20894, USA
tel: (301) 496-2475
fax: (301) 480-9241
e-mail: info@ncbi.nlm.nih.gov
________________________________________________________________________________