Skip to content

abyss-pe 2.1.0 segfault with Open MPI 3.1.0 #236

@jdmontenegro

Description

@jdmontenegro

Please report

System

Hi all I am using abyss 2.1.0 compiled under openmpi/3.1.0, boost/1.66 and sparsehash/2.0.3 on a CENTOS/7 cluster with 1.5 Tb of RAM and 128 threads available.

Assembly error

My abyss command line is the following:
abyss-pe name=NewAssembly G=3000000000 s=500 v=-v np=64 k=97 in="reads1.fastq reads2.fastq"
After 9 and a half hours running I get this error:

[balder-wn05:31600] *** Process received signal ***
[balder-wn05:31600] Signal: Segmentation fault (11)
[balder-wn05:31600] Signal code: Invalid permissions (2)
[balder-wn05:31600] Failing at address: 0x7f618bee27d8
[balder-wn05:31600] [ 0] /usr/lib64/libc.so.6(+0x35270)[0x7f618c2ee270]
[balder-wn05:31600] [ 1] /usr/local/appl/software/openmpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x429c)[0x7f6180b9829c]
[balder-wn05:31600] [ 2] /usr/local/appl/software/openmpi/3.1.0/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f618bb1324c]
[balder-wn05:31600] [ 3] /usr/local/appl/software/openmpi/3.1.0/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f618cf8e154]
[balder-wn05:31600] [ 4] ABYSS-P[0x40dcec]
[balder-wn05:31600] [ 5] ABYSS-P[0x40df34]
[balder-wn05:31600] [ 6] ABYSS-P[0x40f414]
[balder-wn05:31600] [ 7] ABYSS-P[0x4148c8]
[balder-wn05:31600] [ 8] ABYSS-P[0x4169d2]
[balder-wn05:31600] [ 9] ABYSS-P[0x40600a]
[balder-wn05:31600] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f618c2dac05]
[balder-wn05:31600] [11] ABYSS-P[0x40766f]
[balder-wn05:31600] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 46 with PID 31600 on node balder-wn05 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [/data/Bioinfo/bioinfo-proj-jmontenegro/DENOVO/Dunnart/Results/Assembly/Abyss/dunnart_abyss-1.fa] Error 139

The total number of bases sequenced was 160 Gbp for a 3 Gbp diplod genome (~50X sequencing depth )

I am using the slurm scheduler and asking for 1Tb of memory and 64 cpus (64 tasks and 1 cpu per task) for this assembly. I can see that each thread is using around 8.5 Gbp, so 64 * 8.5 = 544 Gbp. That is roughly half the memory allocated for this process. The system administrator is looking into the details of the failure, but so far I cannot find a way around this. I have tried reducing the number of threads to 32 and 16 and the error is the same.

Any help would be much appreciated.

Kind regards,

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions