This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
MXNet 2.x significantly slower than 1.x in Sockeye #20636
Copy link
Copy link
Closed as not planned
Labels
Description
Description
We observe a significant reduction in Sockeye inference speed with a recent build of MXNet 2.x (master branch). Compared to 1.x versions of MXNet, GPU translation with MXNet 2.x is ~2x slower.
For MXNet 2.x, we migrated Sockeye to the Gluon 2.0 interface and adopted the new Numpy namespaces. Otherwise, code is equivalent to master with the same level of hybridization (static_alloc=True) in both branches. The pull request/branch can be found here: awslabs/sockeye#953.
The runs below use half-precision and run on a p3.2xlarge. Outputs are equal.
p3.2xlarge instance
batch size 64
mxnet-cu112 2.0.0b20211001:
[INFO:__main__] Processed 3003 lines. Total time: 37.2888, sec/sent: 0.0124, sent/sec: 80.5336
mxnet-cu112 1.7:
[INFO:__main__] Processed 3003 lines. Total time: 20.2805, sec/sent: 0.0068, sent/sec: 148.0735
batch size 1
mxnet-cu112 2.0.0b20211001:
[INFO:__main__] Processed 3003 lines. Total time: 858.3818, sec/sent: 0.2858, sent/sec: 3.4984
mxnet-cu112 1.7:
[INFO:__main__] Processed 3003 lines. Total time: 302.0189, sec/sent: 0.1006, sent/sec: 9.9431
g4 instance
mx18/out.1.bpe.log:[2021-10-04:20:02:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 316.4692, sec/sent: 0.1054, sent/sec: 9.4891
mx18/out.64.bpe.log:[2021-10-04:20:03:10:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 31.8175, sec/sent: 0.0106, sent/sec: 94.3819
mx20/out.1.bpe.log:[2021-10-04:20:17:32:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 714.5509, sec/sent: 0.2379, sent/sec: 4.2026
mx20/out.64.bpe.log:[2021-10-04:20:18:26:INFO:__main__:read_and_translate] Processed 3003 lines. Total time: 46.4607, sec/sent: 0.0155, sent/sec: 64.6352
To Reproduce
- Download the Sockeye sample model
- Run
translate.shwith themasterbranch of Sockeye - Run
translate.shwith themx2branch of Sockeye
Steps to reproduce
(Paste the commands you ran that produced the error.)
- wget https://github.com/awslabs/sockeye/releases/download/2.3.22/wmt14_en_de.tgz
- tar -xvf wmt14_en_de.tgz
- git clone https://github.com/awslabs/sockeye.git
- pip install -r sockeye/requirements/requirements.gpu-cu112.txt`
mv sockeye/sockeye wmt_14_en_de- cd
wmt_14_en_de bash translate.sh[translate with master branch]git checkout mx2- (Install nightly build of mx2:
pip uninstall mxnet-cu112 ; pip install --pre -f https://dist.mxnet.io/python 'mxnet-cu112') bash translate.sh[translate with mx2 branch]
What have you tried to solve it?
Environment
- Cuda 11.2 (
conda install -c conda-forge nccl cudnn cudatoolkit==11.2) - MXNet 1.8.post0 or MXNet 1.7 vs MXNet 2.x (
2.0.0b20211001)