A singe-core implemetation of frequency-based substring mining. This
implementation requires the https://github.com/simongog/sdsl-lite
library (tested using the release sdsl-lite-2.0.3).
- Download and extract https://github.com/simongog/sdsl-lite/archive/v2.0.3.tar.gz
- install SDSL by running
./install.sh /install/path/sdsl-lite-2.0.3, where/install/pathneed to be specified, - update the correct SDSL installation path into the
fsm-lite/Makefile, - turn on preferred compiler optimization in
fsm-lite/Makefile, and - run
make depend && makeunder the directoryfsm-lite.
For command-line options, see ./fsm-lite --help.
Input files are given as a list of <data-identifier> <data-filename> pairs. The <data-identifier>'s are assumed to be unique. Here's an example how to construct such a list out of all /input/dir/*.fasta files:
for f in /input/dir/*.fasta; do id=$(basename "$f" .fasta); echo $id $f; done > input.list
The files can then be processed by
./fsm-lite -l input.list -t tmp | gzip - > output.txt.gz
where tmp is a prefix filename for storing temporary index files.
- Optimize the time and space usage.
- Multi-threading.
- Support for gzip compressed input.