-
Notifications
You must be signed in to change notification settings - Fork 6.7k
CPU inference is very slow for some model checkpoints #19218
Description
Description
I have trained a resnet50 model. When running inference using an early checkpoint (0003.params) on CPU, it runs on reasonable speed. However, it is much slower to run inference using a later checkpoint (0023.params) on CPU. Both models runs very fast at similar speed on GPU. I ran the profile and it turns out that the issue is at Conv Layer, where some Conv take up to 110ms.
Here are the profiling results:
fast model 0003.params on CPU:
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 672 0.0000 214184.3125 107092.1562
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 5506.6421 5506.6421 5506.6421 5506.6421
MXExecutorForward 100 13.2700 0.0760 0.3040 0.1327
MXNet C API Calls 2301 2.3010 0.0010 2.3010 1.1500
MXImperativeInvokeEx 100 1.5130 0.0090 0.0560 0.0151
MXNDArrayGetShapeEx 1200 0.3880 0.0000 0.0090 0.0003
MXNDArrayReshape64 100 0.1490 0.0010 0.0040 0.0015
MXExecutorOutputs 100 0.0730 0.0000 0.0020 0.0007
MXNDArrayGetContext 200 0.0390 0.0000 0.0010 0.0002
MXNDArrayFree 200 0.0340 0.0000 0.0010 0.0002
MXNDArrayGetDType 200 0.0250 0.0000 0.0010 0.0001
MXNDArrayGetStorageType 100 0.0150 0.0000 0.0010 0.0002
MXNet C API Concurrency 4602 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 5300 3687.2700 0.0430 3.5340 0.6957
LeakyReLU 2500 722.8160 0.1400 2.5450 0.2891
_sg_mkldnn_fully_connected 100 194.0460 1.8040 2.4230 1.9405
BatchNorm 2600 102.9240 0.0120 0.2850 0.0396
Dropout 100 14.9120 0.1140 0.2520 0.1491
DeleteVariable 559 6.8570 0.0010 1.3360 0.0123
CopyCPU2CPU 100 1.5180 0.0130 0.0240 0.0152
_minus_scalar 100 1.3820 0.0110 0.0230 0.0138
_mul_scalar 100 0.9680 0.0070 0.0260 0.0097
slow model 0023.params on CPU
Profile Statistics:
Note the difference in units for different entries.
Device Storage
=================
Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB)
---- ----------- ------------- ------------- -------------
Memory: cpu/0 672 0.0000 214184.3125 107092.1562
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 78402.4453 78402.4453 78402.4453 78402.4453
MXExecutorForward 100 15.5860 0.0890 0.2150 0.1559
MXNet C API Calls 2301 2.3010 0.0010 2.3010 1.1500
MXImperativeInvokeEx 100 1.5790 0.0090 0.0270 0.0158
MXNDArrayGetShapeEx 1200 0.5500 0.0000 0.0100 0.0005
MXNDArrayReshape64 100 0.1820 0.0010 0.0090 0.0018
MXExecutorOutputs 100 0.0740 0.0000 0.0030 0.0007
MXNDArrayGetContext 200 0.0540 0.0000 0.0010 0.0003
MXNDArrayGetDType 200 0.0390 0.0000 0.0010 0.0002
MXNDArrayFree 200 0.0360 0.0000 0.0010 0.0002
MXNDArrayGetStorageType 100 0.0260 0.0000 0.0010 0.0003
MXNet C API Concurrency 4602 0.0000 0.0000 0.0010 0.0000
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_sg_mkldnn_conv 5300 68954.4688 0.0450 109.0230 13.0103
LeakyReLU 2500 1831.9830 0.1510 4.5000 0.7328
_sg_mkldnn_fully_connected 100 186.8240 1.8090 2.2790 1.8682
BatchNorm 2600 115.5410 0.0120 0.8330 0.0444
Dropout 100 15.9720 0.1430 0.2250 0.1597
DeleteVariable 559 6.3650 0.0010 1.4840 0.0114
_minus_scalar 100 1.4650 0.0120 0.0220 0.0147
CopyCPU2CPU 100 1.4390 0.0130 0.0190 0.0144
_mul_scalar 100 0.8480 0.0070 0.0150 0.0085
Error Message
No error
To Reproduce
def run_inference():
out = executor.forward(is_train=False, data=sample)
pass
sym, arg_params, aux_params = mx.model.load_checkpoint('model', 23)
sample = mx.ndarray.zeros((1, 3, 112, 112))
it = 100
ctx = mx.cpu() # mx.cpu()
executor = sym.simple_bind(ctx=ctx, data=sample.shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)
profiler.set_config(profile_all=True,
aggregate_stats=True,
filename='profile_output.json')
# warm-up
for i in range(10):
run_inference()
# Ask the profiler to start recording
profiler.set_state('run')
for i in range(it):
run_inference()
# Make sure all operations have completed
mx.nd.waitall()
# Ask the profiler to stop recording
profiler.set_state('stop')
# Dump all results to log file before download
print(profiler.dumps())
Steps to reproduce
(Paste the commands you ran that produced the error.)
- run above code with attached 0003.params and 0023.params
- compare the output
What have you tried to solve it?
Environment
MXNet 1.4 cu100mkl, MXNet 1.6 cu100mkl, MXNet1.7 native. All versions could reproduce the issue.
0003.params, 0023.params and .json are in the following drive:
https://drive.google.com/drive/folders/1NE9wxNXmzPX35ImYBMLvbCpqLX8zq8Xq?usp=sharing