MXNet 2.x significantly slower than 1.x in Sockeye

## Description
We observe a significant reduction in [Sockeye](https://github.com/awslabs/sockeye) inference speed with a recent build of MXNet 2.x (master branch). Compared to 1.x versions of MXNet, GPU translation with MXNet 2.x is **~2x slower**.

For MXNet 2.x, we migrated Sockeye to the Gluon 2.0 interface and adopted the new Numpy namespaces. Otherwise, code is equivalent to master with the same level of hybridization (`static_alloc=True`) in both branches. The pull request/branch can be found here: https://github.com/awslabs/sockeye/pull/953.

The runs below use half-precision and run on a p3.2xlarge. Outputs are equal.


### p3.2xlarge instance
#### batch size 64
`mxnet-cu112 2.0.0b20211001`:
```
[INFO:__main__] Processed 3003 lines. Total time: 37.2888, sec/sent: 0.0124, sent/sec: 80.5336
```
`mxnet-cu112 1.7`:
```
[INFO:__main__] Processed 3003 lines. Total time: 20.2805, sec/sent: 0.0068, sent/sec: 148.0735
```

#### batch size 1
`mxnet-cu112 2.0.0b20211001`:
```
[INFO:__main__] Processed 3003 lines. Total time: 858.3818, sec/sent: 0.2858, sent/sec: 3.4984
```
`mxnet-cu112 1.7`:
```
[INFO:__main__] Processed 3003 lines. Total time: 302.0189, sec/sent: 0.1006, sent/sec: 9.9431
```

### g4 instance
```
mx18/out.1.bpe.log:[2021-10-04:20:02:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 316.4692, sec/sent: 0.1054, sent/sec: 9.4891
mx18/out.64.bpe.log:[2021-10-04:20:03:10:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 31.8175, sec/sent: 0.0106, sent/sec: 94.3819
mx20/out.1.bpe.log:[2021-10-04:20:17:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 714.5509, sec/sent: 0.2379, sent/sec: 4.2026
mx20/out.64.bpe.log:[2021-10-04:20:18:26:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 46.4607, sec/sent: 0.0155, sent/sec: 64.6352
```

## To Reproduce
- Download the Sockeye sample model
- Run `translate.sh` with the `master` branch of Sockeye
- Run `translate.sh` with the `mx2` branch of Sockeye

### Steps to reproduce
(Paste the commands you ran that produced the error.)
1. wget https://github.com/awslabs/sockeye/releases/download/2.3.22/wmt14_en_de.tgz
2. tar -xvf wmt14_en_de.tgz
3. git clone https://github.com/awslabs/sockeye.git
4. pip install -r sockeye/requirements/requirements.gpu-cu112.txt`
5. `mv sockeye/sockeye wmt_14_en_de`
6. cd `wmt_14_en_de`
7. `bash translate.sh` [translate with master branch]
8. `git checkout mx2`
9. (Install nightly build of mx2: `pip uninstall mxnet-cu112 ; pip install --pre -f https://dist.mxnet.io/python 'mxnet-cu112'`)
10. `bash translate.sh` [translate with mx2 branch]

## What have you tried to solve it?
-

## Environment
- Cuda 11.2 (`conda install -c conda-forge nccl cudnn cudatoolkit==11.2`)
- MXNet 1.8.post0 or MXNet 1.7 vs MXNet 2.x (`2.0.0b20211001`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MXNet 2.x significantly slower than 1.x in Sockeye #20636

Description

p3.2xlarge instance

batch size 64

batch size 1

g4 instance

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MXNet 2.x significantly slower than 1.x in Sockeye #20636

Description

Description

p3.2xlarge instance

batch size 64

batch size 1

g4 instance

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions