Skip to content

Bug with the forbidden symbols when using unpackdb #466

@cutecutecat

Description

@cutecutecat

Expected Behavior

mmseqs unpackdb command output database files from cluster result successfully.

Current Behavior

mmseqs unpackdb command stuck for database file contains '/'.
<>:"/ | ?* is not allowed in file path of Windows(all) and Linux(only /) when created separated database files with sequence names. Therefore, these symbols should be substituted with others.

Steps to Reproduce (for bugs)

mmseqs cluster IPR/gfp IPR/cluster IPR/tmp
mmseqs createseqfiledb IPR/gfp IPR/cluster IPR/cluster_seq 
mmseqs result2flat IPR/gfp IPR/gfp IPR/cluster_seq clu_seq.fasta
mmseqs createdb clu_seq.fasta  clu_result
mkdir IPR/cluster_out
trash-put IPR/cluster_out/*
mmseqs unpackdb clu_result IPR/cluster_out/

MMseqs Output (for bugs)

(xxx) yyy@zzz:~/aaa/shell/predo$ mmseqs unpackdb clu_result IPR/cluster_out/
unpackdb clu_result IPR/cluster_out/ 

MMseqs Version: 15ace29a276be54fee6b9aedd7a1e814a3c7769b
Verbosity       3

Could not open IPR/cluster_out/Q04901|unreviewed|Entactin/nidogen|taxID:7729 for writing!

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 15ace29
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: GNU Make 4.1/cmake version 3.10.2
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Intel(R) Xeon(R) Gold 6230R CPU 256GB
  • Operating system and version: Ubuntu 18.04.1 LTS

Solution

At file src/util/unpack.cpp, substitute these forbidden symbols with others.
As '|' is frequently appeared in sequence name as:
A0A348AT68|unreviewed|Fluorescent
We change '|' to '!':
A0A348AT68!unreviewed!Fluorescent
Other symbols are changed to '@':
W5UC41|unreviewed|Nidogen-1|taxID/7998->W5UC41!unreviewed!Nidogen-1!taxID@7998

Then everything is ok!

(xxx) yyy@zzz~/aaa/shell/predo$ mmseqs unpackdb clu_result IPR/cluster_out/
unpackdb clu_result IPR/cluster_out/ 

MMseqs Version: 15ace29a276be54fee6b9aedd7a1e814a3c7769b
Verbosity       3

[=================================================================] 100.00% 3.19K 0s 81ms     
Time for processing: 0h 0m 0s 90ms

I have fixed it and I will soon write a pull request, so do not worry about it!
pr #467

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions