-
Notifications
You must be signed in to change notification settings - Fork 111
abyss-pe 2.1.0 segfault with Open MPI 3.1.0 #236
Description
Please report
System
Hi all I am using abyss 2.1.0 compiled under openmpi/3.1.0, boost/1.66 and sparsehash/2.0.3 on a CENTOS/7 cluster with 1.5 Tb of RAM and 128 threads available.
Assembly error
My abyss command line is the following:
abyss-pe name=NewAssembly G=3000000000 s=500 v=-v np=64 k=97 in="reads1.fastq reads2.fastq"
After 9 and a half hours running I get this error:
[balder-wn05:31600] *** Process received signal ***
[balder-wn05:31600] Signal: Segmentation fault (11)
[balder-wn05:31600] Signal code: Invalid permissions (2)
[balder-wn05:31600] Failing at address: 0x7f618bee27d8
[balder-wn05:31600] [ 0] /usr/lib64/libc.so.6(+0x35270)[0x7f618c2ee270]
[balder-wn05:31600] [ 1] /usr/local/appl/software/openmpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x429c)[0x7f6180b9829c]
[balder-wn05:31600] [ 2] /usr/local/appl/software/openmpi/3.1.0/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f618bb1324c]
[balder-wn05:31600] [ 3] /usr/local/appl/software/openmpi/3.1.0/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f618cf8e154]
[balder-wn05:31600] [ 4] ABYSS-P[0x40dcec]
[balder-wn05:31600] [ 5] ABYSS-P[0x40df34]
[balder-wn05:31600] [ 6] ABYSS-P[0x40f414]
[balder-wn05:31600] [ 7] ABYSS-P[0x4148c8]
[balder-wn05:31600] [ 8] ABYSS-P[0x4169d2]
[balder-wn05:31600] [ 9] ABYSS-P[0x40600a]
[balder-wn05:31600] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f618c2dac05]
[balder-wn05:31600] [11] ABYSS-P[0x40766f]
[balder-wn05:31600] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 46 with PID 31600 on node balder-wn05 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [/data/Bioinfo/bioinfo-proj-jmontenegro/DENOVO/Dunnart/Results/Assembly/Abyss/dunnart_abyss-1.fa] Error 139
The total number of bases sequenced was 160 Gbp for a 3 Gbp diplod genome (~50X sequencing depth )
I am using the slurm scheduler and asking for 1Tb of memory and 64 cpus (64 tasks and 1 cpu per task) for this assembly. I can see that each thread is using around 8.5 Gbp, so 64 * 8.5 = 544 Gbp. That is roughly half the memory allocated for this process. The system administrator is looking into the details of the failure, but so far I cannot find a way around this. I have tried reducing the number of threads to 32 and 16 and the error is the same.
Any help would be much appreciated.
Kind regards,