Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Race-condition and crash with SymbolBlock on GPU #18765

@chinakook

Description

@chinakook

Description

Severe Bug with nn.SymbolBlock when ctx=mx.gpu(0), cpu is OK.

Error Message

malloc or free or Segmentation fault error may appears randomly

/home/xxxxxx/anaconda3/envs/solo/lib/python3.7/site-packages/mxnet/gluon/block.py:1517: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
        data: None
  input_sym_arg_type = in_param.infer_type()[0]
[17:15:59] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[(1, 256, 56, 56), (1, 512, 28, 28), (1, 1024, 14, 14), (1, 2048, 7, 7)]
malloc(): unsorted double linked list corrupted
[1]    87116 abort (core dumped)  python symbolblockbug.py

/home/xxxxxx/anaconda3/envs/solo/lib/python3.7/site-packages/mxnet/gluon/block.py:1517: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
        data: None
  input_sym_arg_type = in_param.infer_type()[0]
[17:21:29] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[(1, 256, 56, 56), (1, 512, 28, 28), (1, 1024, 14, 14), (1, 2048, 7, 7)]

Segmentation fault: 11

/home/xxxxxx/anaconda3/envs/solo/lib/python3.7/site-packages/mxnet/gluon/block.py:1517: UserWarning: Cannot decide type for the following arguments. Consider providing them as input:
        data: None
  input_sym_arg_type = in_param.infer_type()[0]
[17:23:24] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[(1, 256, 56, 56), (1, 512, 28, 28), (1, 1024, 14, 14), (1, 2048, 7, 7)]
malloc_consolidate(): invalid chunk size
[1]    87701 abort (core dumped)  python symbolblockbug.py

To Reproduce

import mxnet as mx
from mxnet import gluon
from mxnet.gluon import nn
import gluoncv as gcv
class NetEncoder(nn.SymbolBlock):
    def __init__(self, **kwargs):
        base_network = gcv.model_zoo.resnet50_v1(pretrained=False)
        outputs = ['stage1_activation2', 'stage2_activation3', 'stage3_activation5',
                            'stage4_activation2']

        inputs, outputs, params = gcv.nn.feature._parse_network(
            base_network, outputs, ['data'], pretrained=False, ctx=mx.cpu(), **kwargs)
        super(NetEncoder, self).__init__(outputs, inputs, params=params)
    
class Foo(nn.HybridBlock):
    def __init__(self):
        super(Foo, self).__init__()
        self.features = NetEncoder()

    def hybrid_forward(self, F, x):
        y = self.features(x)
        return y

a = mx.nd.random.uniform(shape=(1,3,224,224), ctx=mx.gpu(0))

f = Foo()
f.collect_params().initialize()
f.hybridize()
f.collect_params().reset_ctx(mx.gpu(0))
b = f(a)
print([x.shape for x in b])

Environment

  1. mxnet_cu102-1.7.0b20200719-py2.py3-none-manylinux2014_x86_64
  2. mxnet 2.0 master in April

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions