SynBad is a tool for comparing two related genome assemblies and identify putative translocations and inversions between the two that correspond to gap positions. These positions could indicate misplaced scaffolding.
SynBad will use or create:
A table of gap positions for each assembly (seqname, start, end). This can optionally have long reads mapped and spanning coverage calculated for each gap using Diploidocus. Gaps without spanning long reads are more likely to correspond to misassemblies.
The qryunique and hitunique local hits tables from a GABLAM run using Minimap2.
Pairwise hits between the genomes are filtered according to the
minlocid=PERC and minloclen=INT criteria,
which by default limits hits to be at least 1kb and 50% identity. Note
that this is applied after GABLAM has run, so it should be possible to
re-run with more relaxed constraints and re-use the GABLAM tables.
Next, all gap positions are read in along with the local hits tables.
For each genome, the local hit tables are sorted and QryGap
and SbjGap fields added. Any local alignments with flanking
hits are then flagged in these new fields with the number of flanking
5’, 3’ gaps.
The gap tables will also be updated with GapSpan and
SynSpan fields that have the distance between the
corresponding local hits on the Qry and Sbj genomes. If there is also an
inversion, SynSpan will be negative. If the local hits are
against two different sequences from the other genome, the two sequence
names will be entered in the SynSpan field instead. If the
gap is in the middle of local hit (likely to be true only for small
gaps), SynSpan or GapSpan will have a value of
zero.
Gaps will then be classified according to the associated
GapSpan and SynSpan values:
Aln = Aligned = Gap is found in the middle
of a local alignment to the HitSyn = Syntenic = Difference between
positive SynSpan and GapSpan is
maxsynspan=INT or less (default 25 kb).Ins = Insertion = Achieved
Syntenic rating by skipping upto
maxsynskip=INT local alignments and max
maxsynspan=INT bp in both Qry and Hit.Brk = Breakpoint = Difference between
positive SynSpan and GapSpan is bigger than
the maxsynspan=INT distance.Dup = Duplication = Overlapping flanking
hits on the same strand.Inv = Inversion = Flanking hits are on
alternative strands.Tran = Translocation =
SynSpan indicates matches are on different scaffolds.Frag = Fragmentation =
SynSpan indicates matches are on different scaffolds, 1+ of
which is not a chromosome scaffold.Term = Terminal = Gap is between a local
alignment and the end of the query sequence.Span = Spanned = Any gaps without Aligned
or Syntenic rating that are spanned by at least
synreadspan=INT reads.Null = No mapping between genomes for that gap.If chr1=X and/or chr2=X chromosome scaffold
prefixes are provided then Translocation will be restricted
to matches between two different chromosome scaffolds. Matches including
one or more non-chromosome scaffolds will be classed as
Fragmentation.
Following gap fixes for inversions and translocations/breakpoints (below), following
After the initial mapping and GABLAM unique hit reduction, SynBad
will further compress the Qry-Hit pairs by combining collinear hits
between Qry-Hit pairs into larger blocks. Merged hits must be within
maxsynspan=INT bp, overlap by no more than
maxoverlap=INT bp, and have no intervening assembly gaps.
An initial pass compares adjacent hits in position order. This is
followed by iterative compressions where up to
maxsynskip=INT intervening hits will be incorporated,
providing there are no assembly gaps between the merged hits. This will
start with the longest local hit, scanning downstream and then upstream
for hits to merge, with increasing numbers of intervening hits up to
maxsynskip=INT. Once no more merging is found for that hit,
the next longest hit will be considered. If any hits are merged during
this process, it will be repeated until convergence is reached.
During merging, Length and Identity
statistics will be summed, and additional statistics will be added:
Non = The unaligned bases in the query between the
GABLAM local hits.Alt = The number of bases (Length) of hits
to different hit scaffoldsTrans = The number of bases (Length) of
hits to the same scaffold but outside the merged hits.Inv = The number of bases (Length) of
intermediate hits that are in the right place but inverted strand.Dup = The number of overlapping bases in the query
scaffolds between adjacent hits.NOTE: This will be done for both genomes. For Genome 2, the “hits”
are the Qry scaffolds, and all assessments of collinearity
will be performed on the Hit scaffolds.
Next, SynBad will assess inversions and apparent translocations for flanking collinear hits and instigate a number of updated ratings or suggested fixes. The following new SynBad gap ratings may be added:
Div = Divergence = Translocation or
Breakpoint gap with flanking collinear hits.InvBrk = Inversion Breakpoint =
Reclassified Inversion that appears to be out of place.InvFix = Fixed Inversion = Inversion that
becomes collinear when inverted. (See
*.corrections.tdt)InvDupFix = Fixed Inversion Duplication =
Inversion that becomes a Duplication when inverted. (See
*.corrections.tdt)Long = Long Syntenic = Gap flanked by
collinear Qry/Hit pairs but distance is greater than
maxsynspan=INT (default 25 kb).If fragment=T, the assemblies will then be fragmented on
gaps that are not Syntenic, unless more than
minreadspan=INT reads span the gap.
A future release of Synbad will optionally re-arrange the two assemblies, incorporating gapass assemblies where possible. GapSpanner can also be used to gap-fill spanned gaps prior to running SynBad.
SynBad needs Minimap2 installed. For gapass gap mode,
Flye also needs to be installed. For KAT kmer assessments of flanks, KAT
must be installed (or pre-run on the genomes). To generate documentation
with dochtml, R will need to be installed and a pandoc
environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For full documentation of the SynBad workflow, run with
dochtml=T and read the *.docs.html file
generated.
### ~ Main SynBad run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
genome1=FILE : Genome assembly used as the query in the GABLAM searches []
genome2=FILE : Genome assembly used as the searchdb in the GABLAM searches []
basefile=X : Prefix for output files [synbad]
gablam=X : Optional prefix for GABLAM search [defaults to $BASEFILE.map]
gapmode=X : Diploidocus gap run mode (gapspan/gapass) [gapspan]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [1000]
minlocid=PERC : Minimum percentage identity for aligned chunk to be kept (local %identity) [50]
maxsynskip=INT : Maximum number of local alignments to skip for SynTrans classification [4]
maxsynspan=INT : Maximum distance (bp) between syntenic local alignments to count as syntenic [25000]
synreadspan=INT : Minimum number of reads spanning a gap to change the rating to "Spanned" [5]
checkflanks=LIST: List of lengths flanking gaps that must also be spanned by reads [0,100,1000]
spannedflank=INT: Required flanking distance for synreadspan "Spanned" rating [0]
maxoverlap=INT : Maximum overlap (bp) of adjacent local hits to allow compression [500]
chr1=X : PAFScaff-style chromosome prefix for Genome 1 to distinguish Translocation from Fragmentation []
chr2=X : PAFScaff-style chromosome prefix for Genome 2 to distinguish Translocation from Fragmentation []
### ~ Correction and Fragmentation options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
correct=LIST : List of edit types to try to fix in the assembly (invert/extract/relocate/break/join; T/True=all) [True]
fragment=T/F : Whether to fragment the assembly at gaps marked as non-syntenic if no corrections made [False]
fragtypes=LIST : List of SynBad ratings to trigger fragmentation [Brk,Inv,InvBrk,Frag,Tran]
minreadspan=INT : Min number of Span0 reads in gaps table to prevent fragmentation [1]
minctglen=INT : Extract any contigs below a minimum length threshold [500]
minbadctg=INT : Extract any contigs with bad flanking ratings below a minimum length threshold [5000]
minscafflen=INT : Remove any scaffolds (inc. detached/extracted contigs) below minimum length threshold [500]
gapsize=INT : Size of gaps to add when relocating assembling chunks [500]
rejoin=T/F : Whether to rejoin original gaps that end up split into termini but not too short [True]
### ~ Additional input options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
masked1=FASFILE : Optional masked fasta file for assembly comparison [$BASEFILE1.masked.fasta]
bam1=FILE : Optional BAM file of long reads mapped onto assembly 1 [$BASEFILE1.bam]
paf1=FILE : Optional PAF file of long reads mapped onto assembly 1 [$BASEFILE1.paf]
reads1=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype1=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco1=FILE : Optional BUSCO full results file for genome 1 []
scdep1=NUM : Optional single copy read depth for genome 1 []
masked2=FASFILE : Optional masked fasta file for assembly comparison [$BASEFILE2.masked.fasta]
bam2=FILE : Optional BAM file of long reads mapped onto assembly 2 [$BASEFILE2.bam]
paf2=FILE : Optional PAF file of long reads mapped onto assembly 2 [$BASEFILE2.paf]
reads2=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype2=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco2=FILE : Optional BUSCO full results file for genome 2 []
scdep2=NUM : Optional single copy read depth for genome 2 []
mapflanks1=FILE : Flanks fasta file from previous SynBad run for mapping genome 1 flank identifiers []
mapflanks2=FILE : Flanks fasta file from previous SynBad run for mapping genome 2 flank identifiers []
fullmap=T/F : Whether to abort if not all flanks can be mapped [True]
### ~ HiC Gap Flank options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
hicbam1=FILE : Optional BAM file of HiC reads mapped onto assembly 1 [$BASEFILE1.HiC.bam]
hicbam2=FILE : Optional BAM file of HiC reads mapped onto assembly 2 [$BASEFILE2.HiC.bam]
gapflanks=INT : Size of gap flank regions to output for HiC pairing analysis (0=off) [10000]
pureflanks=T/F : Whether to restrict gap flanks to pure contig sequence (True) or include good gaps (False) [True]
hicscore=X : HiC scoring mode (pairs/score/wtscore) [wtscore]
hicmode=X : Pairwise HiC assessment scoring strategy (synbad/pure/rand/full) [synbad]
hicmin=X : Min. number of HiC read pairs for a "best" HiC pairing ruling [3]
hicdir1=PATH : Path to HiC read ID lists for genome 1 [$BASEFILE.qryflanks/]
hicdir2=PATH : Path to HiC read ID lists for genome 1 [$BASEFILE.hitflanks/]
### ~ Additional output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
newacc1=X : Scaffold name prefix for updated Genome 1 output [None]
newacc2=X : Scaffold name prefix for updated Genome 2 output [None]
bestpair=T/F : Whether to restrict the paired output to the top scaffold pairs [False]
update=T/F : Whether to reload compressed qry and hit tables but re-run additional compression [False]
force=T/F : Whether to force regeneration of SynBad results tables [False]
dochtml=T/F : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Details of the SynBad workflow will be added in a future release.
NOTE: SynBad is still under development. Features and details will continue to be updated.
© 2020 Richard Edwards | [email protected]