-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Low CPU usage of MXNet in subprocesses #13593
Description
MXNet has low CPU usage when running CPU operations in multiple process scenarios. Specifically, for MXNet computation in a subprocess, MxNet can use only 1 or 2 CPUs to do its job. This issue shows different behavior for different variants of MxNet (see below) and on different machines ...
This issue is critical because it slows down the multiprocess object-detection data-loading in gluoncv very significantly, making Faster-RCNN training in gluoncv unusable.
This is tested on the 20181207 version, and other versions (e.g., 1.3.1) show similar problems.
Code to reproduce the issue
Filename: mxnet_cpu_test.py
import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None
def run(need_import):
if need_import:
import mxnet as mx
else:
global mx
A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
while True:
A = mx.nd.dot(A, A)
def parse_args():
parser = argparse.ArgumentParser("benchmark mxnet cpu")
parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
parser.add_argument('--late-import', action='store_true')
return parser.parse_args()
def main(args):
if args.num_workers == 0:
print("Main process")
try:
run(need_import=args.late_import)
except KeyboardInterrupt:
pass
else:
print("Subprocesses")
ex = futures.ProcessPoolExecutor(args.num_workers)
for _ in range(args.num_workers):
ex.submit(run, need_import=args.late_import)
while True:
try:
time.sleep(10000)
except KeyboardInterrupt:
ex.shutdown(wait=False)
break
print("Stopped")
if __name__ == "__main__":
args = parse_args()
if not args.late_import:
import mxnet as mx
main(args)Detailed experiments:
-
Run in the main process:
python3 mxnet_cpu_test.py --num-workers=0

Working fine for all mxnet variants (GPU or CPU-only). -
Run in two subproceses
--mxnet-cu90on p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It uses only 2 CPUs per subprocess.
--mxnet-mklon p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

Same here. It uses only 2 CPUs per subprocess.
--mxnet-mklon CPU-only machine c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

Even worse. It uses only 1.5 CPUs per subprocess.
-- However, for vanilla CPU-versionmxneton c5.18x:
python3 mxnet_cpu_test.py --num-workers=2

It is working better. At least, it uses 5 CPUs per subprocess.
-- Weirdly, still vanilla CPU-versionmxnetbut on GPU machine p3.16x:
python3 mxnet_cpu_test.py --num-workers=2

It is working worse, i.e., 2 CPUs per subprocesses. -
This problem seems relevant to how MXNet manage the thread per subprocess. If I do not
import mxnetin the main process and insteadimport mxnetin each subprocess:
python3 mxnet_cpu_test.py --num-workers=2 --late-import

Then everything is working fine.