-
Notifications
You must be signed in to change notification settings - Fork 265
Description
I've bumped into what appears to be a bug when using easy-search with the --greedy-best-hits flag and --cov-mode 1 -c 1, where hits containing targets that fully cover the query are being filtered out. I've outlined the behaviour in the minimal repro below:
Using file seq.fasta with content (the sequence is arbitrary):
>random_seq
NTLGEQNARIWSEREVGICLARKHPFLDFSVVKDMIATGLGTDYKALSPQ
and running easy-search with the above fasta as both the query and target:
mmseqs easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits produces an empty result.tsv
Running the above command without the --greedy-best-hits flag results in the expected result.tsv:
random_seq random_seq 1.000 50 0 0 1 50 1 50 1.586E-38 108
My guess is that there is a bug in the summarizeresult command being run when --greedy-best-hits is passed in:
summarizeresult tmp/15248047069833069253/result tmp/15248047069833069253/result_best -a 1 --overlap 0 -c 1 --threads 8 --compressed 0 -v 3
But it's quite possible I'm missing something... thanks for your time!
Expected Behavior
A result.tsv with content
random_seq random_seq 1.000 50 0 0 1 50 1 50 1.586E-38 108
Current Behavior
An empty result.tsv
Steps to Reproduce (for bugs)
echo ">random_seq\nNTLGEQNARIWSEREVGICLARKHPFLDFSVVKDMIATGLGTDYKALSPQ" > seq.fasta
mmseqs easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits
MMseqs Output (for bugs)
Details
Create directory tmp
easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits
MMseqs Version: 13.45111
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Add backtrace false
Alignment mode 3
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 1
Coverage mode 1
Max sequence length 65535
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Gap open cost nucl:5,aa:11
Gap extension cost nucl:2,aa:1
Zdrop 40
Threads 8
Compressed 0
Verbosity 3
Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out
Sensitivity 5.7
k-mer length 0
k-score 2147483647
Alphabet size nucl:5,aa:21
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask lower case residues 0
Minimum diagonal score 15
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Maximum seq. id. threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files true
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
Overlap threshold 0
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 0
Greedy best hits true
Alignment backtraces will be computed, since they were requested by output format.
createdb seq.fasta tmp/15248047069833069253/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Converting sequences
Time for merging to query_h: 0h 0m 0s 9ms
Time for merging to query: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 38ms
createdb seq.fasta tmp/15248047069833069253/target --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Converting sequences
Time for merging to target_h: 0h 0m 0s 9ms
Time for merging to target: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 40ms
Create directory tmp/15248047069833069253/search_tmp
search tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/result tmp/15248047069833069253/search_tmp -a 1 --alignment-mode 3 -c 1 --cov-mode 1 -s 5.7 --remove-tmp-files 1
prefilter tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 1 --cov-mode 1 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 8 --compressed 0 -v 3 -s 5.7
Query database size: 1 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 1 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 1 eta -
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 1 eta -
Index statistics
Entries: 41
DB size: 488 MB
Avg k-mer size: 0.000001
Top 10 k-mers
DFVKIA 1
ATLTKA 1
SEEGLA 1
IWEEIC 1
PFDSKD 1
CLRHLD 1
DMAGTD 1
QNRWRE 1
GENRSE 1
LAKPDF 1
Time for index table init: 0h 0m 0s 778ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 1
[=================================================================] 100.00% 1 eta -
602.620000 k-mers per position
41 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
1 sequences passed prefiltering per query sequence
1 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 740ms
align tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 tmp/15248047069833069253/result --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 1 --cov-mode 1 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 8 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 1 type: Aminoacid
Target database size: 1 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 1 eta -
Time for merging to result: 0h 0m 0s 0ms
1 alignments calculated
1 sequence pairs passed the thresholds (1.000000 of overall calculated)
1.000000 hits per query sequence
Time for processing: 0h 0m 0s 28ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 -v 3
Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/aln_0 -v 3
Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/input_0 -v 3
Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/aln_merge -v 3
Time for processing: 0h 0m 0s 1ms
summarizeresult tmp/15248047069833069253/result tmp/15248047069833069253/result_best -a 1 --overlap 0 -c 1 --threads 8 --compressed 0 -v 3
[=================================================================] 100.00% 1 eta -
Time for merging to result_best: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 17ms
convertalis tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/result_best result.tsv --sub-mat nucl:nucleotide.out,aa:blosum62.out --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 8 --compressed 0 -v 3
[=================================================================] 100.00% 1 eta -
Time for merging to result.tsv: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 11ms
rmdb tmp/15248047069833069253/result_best -v 3
Time for processing: 0h 0m 0s 2ms
rmdb tmp/15248047069833069253/result -v 3
Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/target -v 3
Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/target_h -v 3
Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/query -v 3
Time for processing: 0h 0m 0s 2ms
rmdb tmp/15248047069833069253/query_h -v 3
Context
See above.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
- Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): 13.45111
- For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
- Operating system and version: