Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

IndexError in gluon-cv Mask-RCNN validation on master #17485

@Kh4L

Description

@Kh4L

Description

An IndexError occurs during the first validation step when training gluon-cv Mask-RCNN with horovod.

Error Message

[1,2]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
[1,5]<stderr>:Traceback (most recent call last):
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 615, in <module>
[1,5]<stderr>:    train(net, train_data, val_data, eval_metric, batch_size, ctx, logger, args)
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 540, in train
[1,5]<stderr>:    args)
[1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 275, in validate
[1,5]<stderr>:    det_bbox = det_bbox[i].asnumpy()
[1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2554, in asnumpy
[1,5]<stderr>:    ctypes.c_size_t(data.size)))
[1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/base.py", line 273, in check_call
[1,5]<stderr>:    raise get_last_ffi_error()
[1,5]<stderr>:IndexError: Traceback (most recent call last):
[1,5]<stderr>:  File "src/operator/tensor/indexing_op.cu", line 461
[1,5]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[39172,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Steps to reproduce

  1. Compile and install mxnet master
  2. Get gluon-cv master
  3. horovodrun -np 8 -H localhost:8 python ./gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --dataset coco -j 4 --log-interval 1000 --use-fpn --horovod --amp --batch-size 16 --lr 0.02 --lr-warmup 500 --epochs 1

Environment

We recommend using our script for collecting the diagnostic information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions