1 DepthKopy: Single-copy read-depth and kmer based copy number analysis

DepthKopy is an updated version of the regcnv methods of Diploidocus, using the same single-copy read depth estimation from BUSCO Complete genes as DepthSizer.

DepthKopy needs a genome assembly (fasta format, seqin=FILE), a set of long read (ONT, PacBio or HiFi) data for the assembly (reads=FILELIST and readtype=LIST) or mapped BAM file (bam=FILE), and a BUSCO/BUSCOMP full table of results (busco=TSVFILE) or pre-calculated single-copy read depth (scdepth=NUM). For kmer analysis, reads also need to be provided with kmerreads=FILELIST (and 10xtrim=T if barcoded 10x linked reads).

Optionally, it can then take one or more additional sources of assembly regions for which stats will be calculated. Delimited text files or GFF files can be provided using regfile=FILE for a single file, or regfile=CDICT for multiple files. Multiple files are given in the form Name1:File1,Name2:File2,...,NameN:FileN, from which each unique Name will be plotted as a separate violin (see output). GFF files will be parsed to extract any features matching the types given by gfftype=LIST (default = gene). Delimited files will need to have headers that match those provided by checkfields=LIST (default = SeqName,Start,End). By default, 100 kb non-overlapping windows will also be output (winsize=INT winstep=NUM). Unless seqstats=F, statistics will also be calculated per assembly scaffold. Subsets of scaffolds can be given separate window output plots using chromcheck=LIST (scaffold names) or chromcheck=INT (min size).

DepthKopy works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth.

DepthKopy uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth. This is converted into an estimated single copy read depth using a smoothed density plot of BUSCO single copy genes. BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table matching the BUSCO fields: [‘BuscoID’,‘Status’,‘Contig’,‘Start’,‘End’,‘Score’,‘Length’]. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy. Output from BUSCOMP is also compatible with DepthKopy. DepthKopy has been tested with outputs from BUSCO v3 and v5.

Output is a table of depth statistics and predicted copy number for each input dataset (BUSCO Complete, BUSCO Duplicated, regfile Regions, assembly scaffolds (seqstat=T) and sliding windows). Data visualisations are also output for each region set using ggstatsplot.

1.1 Citation

DepthKopy has been published as part of the Waratah genome paper:

Chen SH, Rossetto M, van der Merwe M, Lu-Irving P, Yap JS, Sauquet H, Bourke G, Amos TG, Bragg JG & Edwards RJ (2022). Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. Molecular Ecology Resources doi: 10.1111/1755-0998.13574

Please contact the author if you have trouble getting the full text version, or read the bioRxiv preprint version:

Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. bioRxiv 2021.06.02.444084; doi: 10.1101/2021.06.02.444084.

2 Running DepthKopy

DepthKopy is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/depthkopy.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone DepthKopy git repo, $CODEPATH will be the path the to code/ directory.

2.1 Dependencies

Unless bam=FILE is given, minimap2 must be installed and either added to the environment $PATH or given to DepthSizer with the minimap2=PROG setting, and samtools needs to be installed. R and tidyverse will also need be installed, ideally with the ggstatsplot and writexl libraries also installed. To generate documentation with dochtml, a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For DepthKopy documentation, run with dochtml=T and read the *.docs.html file generated.

2.2 Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Main DepthKopy run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Input sequence assembly. (Must be *.fa or *.fasta for kmerself=T.) [None]
basefile=FILE   : Root of output file names [diploidocus or $SEQIN basefile]
scdepth=NUM     : Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0]
bam=FILE        : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
bamcsi=T/F      : Use CSI indexing for BAM files, not BAI (needed for v long scaffolds) [False]
reads=FILELIST  : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST   : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F     : Generate HTML DepthKopy documentation (*.docs.html) instead of main run [False]
### ~ Depth and Copy Number options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
busco=TSVFILE   : BUSCO full table [full_table_$BASEFILE.busco.tsv]
quickdepth=T/F  : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depchunk=INT    : Chunk input into minimum of INT bp chunks for temp depth calculation [1e6]
deponly=T/F     : Cease execution following checking/creating BAM and fastdep/fastmp files [False]
depfile=FILE    : Precomputed depth file (*.fastdep or *.fastmp) to use [None]
homfile=FILE    : Precomputed homology depth file (*.fasthom) to use [None]
regfile=CDICT   : List of Name:Files (or single FILE) of SeqName, Start, End positions (or GFF) for CN checking [None]
checkfields=LIST: Fields in checkpos file to give Locus, Start and End for checking [SeqName,Start,End]
gfftype=LIST    : Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene']
winsize=INT     : Generate additional window-based depth and CN profiles of INT bp (0 to switch off) [100000]
winstep=NUM     : Generate window every NUM bp (or fraction of winsize=INT if <=1) [1]
chromcheck=LIST : Output separate window analysis violin plots for listed sequences (or min size) + 'Other' []
seqstats=T/F    : Whether to output CN and depth data for full sequences as well as BUSCO genes [True]
### ~ Rscript options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
outdir=PATH     : Redirect the outputs of the depthcopy.R script into outdir [./]
pointsize=INT   : Rescale the font size for the DepthKopy plots [24]
### ~ KAT kmer options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
kmerself=T/F        : Whether to perform additional assembly kmer analysis [True]
kmeralt=FILE        : Fasta file of alternative assembly for KAT kmer analysis [None]
kmerreads=FILELIST  : File of high quality reads for KAT kmer analysis []
10xtrim=T/F         : Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False]
### ~ System options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X         : Number of parallel sequences to process at once [0]
killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X     : Sleep time (seconds) between cycles of forking out more process [0]
tmpdir=PATH     : Path for temporary output files during forking [./tmpdir/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

3 DepthKopy workflow and options

The main inputs for DepthKopy copy number prediction are: * seqin=FILE : Input sequence assembly [Required]. * reads=FILELIST: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped. A BAM file can be supplied instead with bam=FILE. * readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont] * busco=TSVFILE : BUSCO(MP) full table [full_table_$BASEFILE.busco.tsv] used for calculating single copy (“diploid”) read depth.

3.1 Step 1: BAM file (read mapping)

The first step is to generate a BAM file by mapping reads on to seqin using minimap2. A pre-generated BAM file can be given instead using bam=FILE. There should be no secondary mapping of reads, as these will inflate read depths, so filter these out if they were allowed during mapping. Similarly, the BAM file should not contain unmapped reads. (These should be filtered during processing if present.) If no BAM file setting is given, the BAM file will be named $BASEFILE.bam, where $BASEFILE is set by basefile=X.

3.2 Step 2: BUSCO(MP) results

DepthKopy works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some poor-quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth. This is estimated using a smoothed density distribution calculated using R density(). BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields: [‘BuscoID’,‘Status’,‘Contig’,‘Start’,‘End’,‘Score’,‘Length’]. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy. BUSCOMP v0.10.0 and above will generate a *.complete.tsv file that can be used in place of BUSCO results. This can enable rapid re-annotation of BUSCO genes following, for example, vector trimming with Diploidocus.

3.3 Step 3: Single-copy read depth

DepthKopy uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth and extracts the smoothed modal read depth for all single-copy (Complete BUSCO genes) using the density() function of R. To avoid a minority of extremely deep-coverage bases disrupting the density profile, the depth range is first limited to the range from zero to 1000, or four time the pure modal read depth if over 1000. If the pure mode is zero coverage, zero is returned. The number of bins for the density function is set to be greater than 5 times the max depth for the calculation.

By default, the density bandwidth smoothing parameter is set to adjust=12. This can be modified with depadjust=INT. The raw and smoothed profiles are output to *.plots/*.raw.png *.plots/*.scdepth.png to check smoothing if required. Additional checking plots are also output (see Outputs below).

The full output of depths per position is output to $BAM.fastmp (or $BAM.fastdep if quickdepth=T). The single-copy is also output to $BAM.fastmp.scdepth. By default, generation of the fastdep/fastmp data is performed by chunking up the assembly and creating temporary files in parallel (tmpdir=PATH). Sequences are batched in order such that each batch meets the minimum size criterion set by depchunk=INT (default 1Mbp). If depchunk=0 then each sequence will be processed individually. This is not recommended for large, highly fragmented genomes. Unless dev=T or debug=T, the temporary files will be deleted once the final file is made. If DepthSizer crashed during the generation of the file, it should be possible to re-run and it will re-use existing temporary files.

3.4 Step 5: Copy Number estimation

For each region analysed, the same density profile calculation is used to predict the dominant read depth across the region, which is then converted into copy number (CN) by dividing by the single-copy read depth. Confidence intervals are calculated based on random sampling of the observed single copy read depth. (Details to follow: available on request.) By default, the R script will parallelise this using the number of threads set with forks=INT. If this causes memory issues, it can be forced to run with a single thread using memsaver=T.

3.4.1 Region collapse for depth-adjusted copy number

Regions provided for DepthKopy summaries using the regfile=LIST can be collapsed to provide overall summary statistics using the collapse=LIST argument. If a delimited region file has been provided, any fields in collapse=LIST (Family by default) will be used to group and collapse regions. DepthKopy will output the number (N), summed length (BP), predicted copy number (CN) and CN-adjusted summed length (AdjBP) for each unique value of the collapse field. Copy number (XN) and summed lengths (XBP) adjusted by Mean depth (i.e. MeanX / SCDepth) are also output. This collapsing is done at three levels: (1) per sequence, (2) totals for the whole assembly, and (3) combined totals over all values of the collapse field. Note that no adjustment for overlapping features is made for the latter calculation. RepeatMasker GFF files will extract the repeat motif name into Family. Barrnap rRNA prediction GFF files will extract the rRNA gene product into Family. This enables a depth-adjusted estimate of rRNA and other repeat copy numbers.

3.5 Step 6: Outputs

Finally, CN and depth statistics are integrated with kmer and within-assembly homology data, and output as a series of plots and tables (below).

4 Outputs

The main DepthKopy outputs are:

*.regcnv.tsv = CN and depth statistics for different input region datasets.
*.xlsx = compiled Excel file of each region file.
*.log = DepthKopy log file with key steps and details of any errors or warnings generated.
*.plots/ = Directory of PNG plots (see below)

4.1 Region CN output tables

The BUSCO input file will get *.regcnv.tsv and *.dupcnv.tsv copy number prediction tables generated. Other region datasets will have a *.regcnv.tsv file. These will have the following fields added:

MeanX = Mean sequencing depth across region.
MedX = Median sequencing depth across region.
ModeX = Pure modal sequencing depth across region.
DensX = Density-plot adjusted modal sequencing depth across region.
SelfK = Density-plot adjusted modal kmer frequency across region. (In development.)
HomPC = Percentage of region with within-assembly homology identified. (In development.)

4.2 DepthKopy plots

DepthKopy will also generate a number of plots of results, which are output in $BASE.plots/ as PDF and PNG files.

First, the raw and smoothed read depth profiles will be output to: * $BASE.plots/$BASE.raw.png = raw depth profile * $BASE.plots/$BASE.scdepth.png = smoothed depth profile with SC depth marked

In addition, violin plots will be generated for the following $STAT values:

$BASE.plots/$BASE.CN.png = Estimated copy number.
$BASE.plots/$BASE.MeanX.png = Mean depth of coverage
$BASE.plots/$BASE.MedX.png = Median depth of coverage
$BASE.plots/$BASE.ModeX.png = Pure Modal depth of coverage
$BASE.plots/$BASE.DensX.png = Smoothed density modal depth of coverage

If seqstats=T then each assembly sequence will also be output in a Sequences violin plot for comparison. Each point in the plots is a separate gene or sequence.

Values for individual genes/sequences are also output as a density scatter plot named $BASE.plots/$BASE.$REGIONS.$STAT.png, where

$REGIONS is the type of region plotted (BUSCO complete, Duplicated, or assembly Sequences).
$STAT is the output statistic: MeanX, MedX, ModeX or DensX.

4.2.1 Re-running failed plots

If DepthKopy fails to run to completion, or you wish to change axes limits and/or edit out some of the features, you can also generate plots from the *.xlsx output file using the depthcopyplot.R file:

Rscript $CODEPATH/depthcopyplot.R basefile=$BASE [scdepth=NUM] [xlsx=FILE] [xsheets=LIST] [reghead=LIST] [pngdir=PATH] [cnmax=INT] [sigdif=T/F] [rdir=PATH]

DepthKopy V1.6.0 Documentation

Richard J. Edwards

2024-09-19