1 SynBad: Synteny-based scaffolding adjustment

SynBad is a tool for comparing two related genome assemblies and identify putative translocations and inversions between the two that correspond to gap positions. These positions could indicate misplaced scaffolding.

SynBad will use or create:

A table of gap positions for each assembly (seqname, start, end). This can optionally have long reads mapped and spanning coverage calculated for each gap using Diploidocus. Gaps without spanning long reads are more likely to correspond to misassemblies.
The qryunique and hitunique local hits tables from a GABLAM run using Minimap2.

Pairwise hits between the genomes are filtered according to the minlocid=PERC and minloclen=INT criteria, which by default limits hits to be at least 1kb and 50% identity. Note that this is applied after GABLAM has run, so it should be possible to re-run with more relaxed constraints and re-use the GABLAM tables.

Next, all gap positions are read in along with the local hits tables. For each genome, the local hit tables are sorted and QryGap and SbjGap fields added. Any local alignments with flanking hits are then flagged in these new fields with the number of flanking 5’, 3’ gaps.

The gap tables will also be updated with GapSpan and SynSpan fields that have the distance between the corresponding local hits on the Qry and Sbj genomes. If there is also an inversion, SynSpan will be negative. If the local hits are against two different sequences from the other genome, the two sequence names will be entered in the SynSpan field instead. If the gap is in the middle of local hit (likely to be true only for small gaps), SynSpan or GapSpan will have a value of zero.

Gaps will then be classified according to the associated GapSpan and SynSpan values:

Aln = Aligned = Gap is found in the middle of a local alignment to the Hit
Syn = Syntenic = Difference between positive SynSpan and GapSpan is maxsynspan=INT or less (default 25 kb).
Ins = Insertion = Achieved Syntenic rating by skipping upto maxsynskip=INT local alignments and max maxsynspan=INT bp in both Qry and Hit.
Brk = Breakpoint = Difference between positive SynSpan and GapSpan is bigger than the maxsynspan=INT distance.
Dup = Duplication = Overlapping flanking hits on the same strand.
Inv = Inversion = Flanking hits are on alternative strands.
Tran = Translocation = SynSpan indicates matches are on different scaffolds.
Frag = Fragmentation = SynSpan indicates matches are on different scaffolds, 1+ of which is not a chromosome scaffold.
Term = Terminal = Gap is between a local alignment and the end of the query sequence.
Span = Spanned = Any gaps without Aligned or Syntenic rating that are spanned by at least synreadspan=INT reads.
Null = No mapping between genomes for that gap.

If chr1=X and/or chr2=X chromosome scaffold prefixes are provided then Translocation will be restricted to matches between two different chromosome scaffolds. Matches including one or more non-chromosome scaffolds will be classed as Fragmentation.

Following gap fixes for inversions and translocations/breakpoints (below), following

After the initial mapping and GABLAM unique hit reduction, SynBad will further compress the Qry-Hit pairs by combining collinear hits between Qry-Hit pairs into larger blocks. Merged hits must be within maxsynspan=INT bp, overlap by no more than maxoverlap=INT bp, and have no intervening assembly gaps. An initial pass compares adjacent hits in position order. This is followed by iterative compressions where up to maxsynskip=INT intervening hits will be incorporated, providing there are no assembly gaps between the merged hits. This will start with the longest local hit, scanning downstream and then upstream for hits to merge, with increasing numbers of intervening hits up to maxsynskip=INT. Once no more merging is found for that hit, the next longest hit will be considered. If any hits are merged during this process, it will be repeated until convergence is reached.

During merging, Length and Identity statistics will be summed, and additional statistics will be added:

Non = The unaligned bases in the query between the GABLAM local hits.
Alt = The number of bases (Length) of hits to different hit scaffolds
Trans = The number of bases (Length) of hits to the same scaffold but outside the merged hits.
Inv = The number of bases (Length) of intermediate hits that are in the right place but inverted strand.
Dup = The number of overlapping bases in the query scaffolds between adjacent hits.

NOTE: This will be done for both genomes. For Genome 2, the “hits” are the Qry scaffolds, and all assessments of collinearity will be performed on the Hit scaffolds.

Next, SynBad will assess inversions and apparent translocations for flanking collinear hits and instigate a number of updated ratings or suggested fixes. The following new SynBad gap ratings may be added:

Div = Divergence = Translocation or Breakpoint gap with flanking collinear hits.
InvBrk = Inversion Breakpoint = Reclassified Inversion that appears to be out of place.
InvFix = Fixed Inversion = Inversion that becomes collinear when inverted. (See *.corrections.tdt)
InvDupFix = Fixed Inversion Duplication = Inversion that becomes a Duplication when inverted. (See *.corrections.tdt)
Long = Long Syntenic = Gap flanked by collinear Qry/Hit pairs but distance is greater than maxsynspan=INT (default 25 kb).

If fragment=T, the assemblies will then be fragmented on gaps that are not Syntenic, unless more than minreadspan=INT reads span the gap.

A future release of Synbad will optionally re-arrange the two assemblies, incorporating gapass assemblies where possible. GapSpanner can also be used to gap-fill spanned gaps prior to running SynBad.

1.1 Dependencies

SynBad needs Minimap2 installed. For gapass gap mode, Flye also needs to be installed. For KAT kmer assessments of flanks, KAT must be installed (or pre-run on the genomes). To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For full documentation of the SynBad workflow, run with dochtml=T and read the *.docs.html file generated.

1.2 Commandline options

### ~ Main SynBad run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
genome1=FILE    : Genome assembly used as the query in the GABLAM searches []
genome2=FILE    : Genome assembly used as the searchdb in the GABLAM searches []
basefile=X      : Prefix for output files [synbad]
gablam=X        : Optional prefix for GABLAM search [defaults to $BASEFILE.map]
gapmode=X       : Diploidocus gap run mode (gapspan/gapass) [gapspan]
minloclen=INT   : Minimum length for aligned chunk to be kept (local hit length in bp) [1000]
minlocid=PERC   : Minimum percentage identity for aligned chunk to be kept (local %identity) [50]
maxsynskip=INT  : Maximum number of local alignments to skip for SynTrans classification [4]
maxsynspan=INT  : Maximum distance (bp) between syntenic local alignments to count as syntenic [25000]
synreadspan=INT : Minimum number of reads spanning a gap to change the rating to "Spanned" [5]
checkflanks=LIST: List of lengths flanking gaps that must also be spanned by reads [0,100,1000]
spannedflank=INT: Required flanking distance for synreadspan "Spanned" rating [0]
maxoverlap=INT  : Maximum overlap (bp) of adjacent local hits to allow compression [500]
chr1=X          : PAFScaff-style chromosome prefix for Genome 1 to distinguish Translocation from Fragmentation []
chr2=X          : PAFScaff-style chromosome prefix for Genome 2 to distinguish Translocation from Fragmentation []
### ~ Correction and Fragmentation options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
correct=LIST    : List of edit types to try to fix in the assembly (invert/extract/relocate/break/join; T/True=all) [True]
fragment=T/F    : Whether to fragment the assembly at gaps marked as non-syntenic if no corrections made [False]
fragtypes=LIST  : List of SynBad ratings to trigger fragmentation [Brk,Inv,InvBrk,Frag,Tran]
minreadspan=INT : Min number of Span0 reads in gaps table to prevent fragmentation [1]
minctglen=INT   : Extract any contigs below a minimum length threshold [500]
minbadctg=INT   : Extract any contigs with bad flanking ratings below a minimum length threshold [5000]
minscafflen=INT : Remove any scaffolds (inc. detached/extracted contigs) below minimum length threshold [500]
gapsize=INT     : Size of gaps to add when relocating assembling chunks [500]
rejoin=T/F      : Whether to rejoin original gaps that end up split into termini but not too short [True]
### ~ Additional input options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
masked1=FASFILE : Optional masked fasta file for assembly comparison [$BASEFILE1.masked.fasta]
bam1=FILE       : Optional BAM file of long reads mapped onto assembly 1 [$BASEFILE1.bam]
paf1=FILE       : Optional PAF file of long reads mapped onto assembly 1 [$BASEFILE1.paf]
reads1=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype1=LIST  : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco1=FILE     : Optional BUSCO full results file for genome 1 []
scdep1=NUM      : Optional single copy read depth for genome 1 []
masked2=FASFILE : Optional masked fasta file for assembly comparison [$BASEFILE2.masked.fasta]
bam2=FILE       : Optional BAM file of long reads mapped onto assembly 2 [$BASEFILE2.bam]
paf2=FILE       : Optional PAF file of long reads mapped onto assembly 2 [$BASEFILE2.paf]
reads2=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype2=LIST  : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco2=FILE     : Optional BUSCO full results file for genome 2 []
scdep2=NUM      : Optional single copy read depth for genome 2 []
mapflanks1=FILE : Flanks fasta file from previous SynBad run for mapping genome 1 flank identifiers []
mapflanks2=FILE : Flanks fasta file from previous SynBad run for mapping genome 2 flank identifiers []
fullmap=T/F     : Whether to abort if not all flanks can be mapped [True]
### ~ HiC Gap Flank options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
hicbam1=FILE    : Optional BAM file of HiC reads mapped onto assembly 1 [$BASEFILE1.HiC.bam]
hicbam2=FILE    : Optional BAM file of HiC reads mapped onto assembly 2 [$BASEFILE2.HiC.bam]
gapflanks=INT   : Size of gap flank regions to output for HiC pairing analysis (0=off) [10000]
pureflanks=T/F  : Whether to restrict gap flanks to pure contig sequence (True) or include good gaps (False) [True]
hicscore=X      : HiC scoring mode (pairs/score/wtscore) [wtscore]
hicmode=X       : Pairwise HiC assessment scoring strategy (synbad/pure/rand/full) [synbad]
hicmin=X        : Min. number of HiC read pairs for a "best" HiC pairing ruling [3]
hicdir1=PATH    : Path to HiC read ID lists for genome 1 [$BASEFILE.qryflanks/]
hicdir2=PATH    : Path to HiC read ID lists for genome 1 [$BASEFILE.hitflanks/]
### ~ Additional output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
newacc1=X       : Scaffold name prefix for updated Genome 1 output [None]
newacc2=X       : Scaffold name prefix for updated Genome 2 output [None]
bestpair=T/F    : Whether to restrict the paired output to the top scaffold pairs [False]
update=T/F      : Whether to reload compressed qry and hit tables but re-run additional compression [False]
force=T/F       : Whether to force regeneration of SynBad results tables [False]
dochtml=T/F     : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

1.3 SynBad workflow

Details of the SynBad workflow will be added in a future release.

NOTE: SynBad is still under development. Features and details will continue to be updated.

SynBad V0.12.2 Documentation

Richard J. Edwards

2024-09-19

1 SynBad: Synteny-based scaffolding adjustment

1.1 Dependencies

1.2 Commandline options

1.3 SynBad workflow