Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

SegFault while testing MXNet binaries for CUDA-11.0 using pytest #19360

@mseth10

Description

@mseth10

Description

Nightly CD pipeline fails for CUDA 11.0 during testing of MXNet binaries using pytest. All tests run successfully. The error is thrown during cleanup after pytest is done running a testing module. This error was first recorded when 480d027 commit was merged, which dropped pytest's teardown function. Before this commit, the CD pipeline was running successfully for all flavors.

This error is specific to CUDA 11.0 and is not observed for CUDA 10.0 and 10.1 as can be seen here:
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1848/pipeline/361/

Error Message

Stack trace:
Stack trace:
  /usr/lib64/libcudnn_ops_infer.so.8 (                                           + 0x15cb68f)  [0x7f7f4ce3e68f]
  /usr/lib64/libcudnn_ops_infer.so.8 ( cudnnDestroy                              + 0x6f  )  [0x7f7f4ba78ddf]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( mshadow::Stream<mshadow::gpu>::DestroyDnnHandle()  + 0x2c  )  [0x7f81869a29ec]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)  + 0x13b )  [0x7f81869a2c3b]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)  + 0x1bb )  [0x7f81869b83ab]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)  + 0x36  )  [0x7f81869b86f6]
  /work/mxnet/python/mxnet/../../lib/libmxnet.so ( std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()  + 0x32  )  [0x7f81869b3db2]
/work/runtime_functions.sh: line 747:     6 Segmentation fault      (core dumped) pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py
2020-10-16 07:44:31,682 - root - INFO - Waiting for status of container a8b282e29adf for 600 s.
2020-10-16 07:44:31,853 - root - INFO - Container exit status: {'Error': None, 'StatusCode': 139}
2020-10-16 07:44:31,854 - root - ERROR - Container exited with an error 😞
2020-10-16 07:44:31,854 - root - INFO - Executed command for reproduction:

ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110

Steps to reproduce

I was able to reproduce the error by following these steps on an AWS Ubuntu18 Deep Learning Base AMI:

alias python=python3

git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
pip3 install -r ci/requirements.txt --user

sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

python ci/build.py -e BRANCH=null --docker-registry mxnetci --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_static_libmxnet cu110

python ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu110

What have you tried to solve it?

  1. The above script takes a long time to run as it runs a lot of tests. I reduced the reproduction time by reducing the number of tests. Here's a code diff:
diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
index 40405b961..6992caa36 100755
--- a/ci/docker/runtime_functions.sh
+++ b/ci/docker/runtime_functions.sh
@@ -756,7 +756,9 @@ cd_unittest_ubuntu() {
     export DMLC_LOG_STACK_TRACE_DEPTH=10
 
     local mxnet_variant=${1:?"This function requires a mxnet variant as the first argument"}
+    pytest -m 'serial' -s --durations=50 --verbose tests/python/gpu/test_gluon_gpu.py
 
+    : '
     OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -m 'not serial' -n 4 --durations=50 --verbose tests/python/unittest
     pytest -m 'serial' --durations=50 --verbose tests/python/unittest
 
@@ -782,6 +784,7 @@ cd_unittest_ubuntu() {
     if [[ ${mxnet_variant} = *mkl ]]; then
         OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -n 4 --durations=50 --verbose tests/python/mkl
     fi
+    '
 }
  1. I put a print statement before the waitall command to check whether it gets executed and observed that it gets executed after the module ends as expected.

  2. I tried replacing mx.npx.waitall() with mx.nd.waitall(), but that doesn't solve this problem.

Environment

We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information
----------Python Info----------
Version      : 3.6.9
Compiler     : GCC 8.4.0
Build        : ('default', 'Oct  8 2020 12:12:24')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.2.3
Directory    : /usr/local/lib/python3.6/dist-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-5.4.0-1028-aws-x86_64-with-Ubuntu-18.04-bionic
system       : Linux
node         : ip-172-31-5-167
release      : 5.4.0-1028-aws
version      : #29~18.04.1-Ubuntu SMP Tue Oct 6 17:14:23 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3109.947
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0088 sec, LOAD: 0.6286 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1193 sec, LOAD: 0.1101 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.055264949798583984 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0012 sec, LOAD: 0.1100 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0014 sec, LOAD: 0.3008 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.0010542869567871094 sec.
----------Environment----------
@leezu @TristonC

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions