CPU inference is very slow for some model checkpoints

## Description
I have trained a resnet50 model. When running inference using an early checkpoint (0003.params) on CPU, it runs on reasonable speed. However, it is much slower to run inference using a later checkpoint (0023.params) on CPU. Both models runs very fast at similar speed on GPU. I ran the profile and it turns out that the issue is at Conv Layer, where some Conv take up to 110ms. 

Here are the profiling results:
fast model 0003.params on CPU:
```
Profile Statistics:
	Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                         672           0.0000      214184.3125      107092.1562
MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1        5506.6421        5506.6421        5506.6421        5506.6421
MXExecutorForward                     100          13.2700           0.0760           0.3040           0.1327
MXNet C API Calls                    2301           2.3010           0.0010           2.3010           1.1500
MXImperativeInvokeEx                  100           1.5130           0.0090           0.0560           0.0151
MXNDArrayGetShapeEx                  1200           0.3880           0.0000           0.0090           0.0003
MXNDArrayReshape64                    100           0.1490           0.0010           0.0040           0.0015
MXExecutorOutputs                     100           0.0730           0.0000           0.0020           0.0007
MXNDArrayGetContext                   200           0.0390           0.0000           0.0010           0.0002
MXNDArrayFree                         200           0.0340           0.0000           0.0010           0.0002
MXNDArrayGetDType                     200           0.0250           0.0000           0.0010           0.0001
MXNDArrayGetStorageType               100           0.0150           0.0000           0.0010           0.0002
MXNet C API Concurrency              4602           0.0000           0.0000           0.0010           0.0000
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                      5300        3687.2700           0.0430           3.5340           0.6957
LeakyReLU                            2500         722.8160           0.1400           2.5450           0.2891
_sg_mkldnn_fully_connected             100         194.0460           1.8040           2.4230           1.9405
BatchNorm                            2600         102.9240           0.0120           0.2850           0.0396
Dropout                               100          14.9120           0.1140           0.2520           0.1491
DeleteVariable                        559           6.8570           0.0010           1.3360           0.0123
CopyCPU2CPU                           100           1.5180           0.0130           0.0240           0.0152
_minus_scalar                         100           1.3820           0.0110           0.0230           0.0138
_mul_scalar                           100           0.9680           0.0070           0.0260           0.0097
```

slow model 0023.params on CPU
```
Profile Statistics:
	Note the difference in units for different entries.
Device Storage
=================
Name                          Total Count    Min Use  (kB)    Max Use  (kB)    Avg Use  (kB)
----                          -----------    -------------    -------------    -------------
Memory: cpu/0                         672           0.0000      214184.3125      107092.1562
MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1       78402.4453       78402.4453       78402.4453       78402.4453
MXExecutorForward                     100          15.5860           0.0890           0.2150           0.1559
MXNet C API Calls                    2301           2.3010           0.0010           2.3010           1.1500
MXImperativeInvokeEx                  100           1.5790           0.0090           0.0270           0.0158
MXNDArrayGetShapeEx                  1200           0.5500           0.0000           0.0100           0.0005
MXNDArrayReshape64                    100           0.1820           0.0010           0.0090           0.0018
MXExecutorOutputs                     100           0.0740           0.0000           0.0030           0.0007
MXNDArrayGetContext                   200           0.0540           0.0000           0.0010           0.0003
MXNDArrayGetDType                     200           0.0390           0.0000           0.0010           0.0002
MXNDArrayFree                         200           0.0360           0.0000           0.0010           0.0002
MXNDArrayGetStorageType               100           0.0260           0.0000           0.0010           0.0003
MXNet C API Concurrency              4602           0.0000           0.0000           0.0010           0.0000
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_sg_mkldnn_conv                      5300       68954.4688           0.0450         109.0230          13.0103
LeakyReLU                            2500        1831.9830           0.1510           4.5000           0.7328
_sg_mkldnn_fully_connected             100         186.8240           1.8090           2.2790           1.8682
BatchNorm                            2600         115.5410           0.0120           0.8330           0.0444
Dropout                               100          15.9720           0.1430           0.2250           0.1597
DeleteVariable                        559           6.3650           0.0010           1.4840           0.0114
_minus_scalar                         100           1.4650           0.0120           0.0220           0.0147
CopyCPU2CPU                           100           1.4390           0.0130           0.0190           0.0144
_mul_scalar                           100           0.8480           0.0070           0.0150           0.0085
```

### Error Message
No error

## To Reproduce
```
    def run_inference():
        out = executor.forward(is_train=False, data=sample)
        pass
    sym, arg_params, aux_params = mx.model.load_checkpoint('model', 23)
    sample = mx.ndarray.zeros((1, 3, 112, 112))
    it = 100
    ctx = mx.cpu()  # mx.cpu()
    executor = sym.simple_bind(ctx=ctx, data=sample.shape, grad_req='null', force_rebind=True)
    executor.copy_params_from(arg_params, aux_params)

    profiler.set_config(profile_all=True,
                        aggregate_stats=True,
                        filename='profile_output.json')

    # warm-up
    for i in range(10):
        run_inference()

    # Ask the profiler to start recording
    profiler.set_state('run')

    for i in range(it):
        run_inference()

    # Make sure all operations have completed
    mx.nd.waitall()
    # Ask the profiler to stop recording
    profiler.set_state('stop')
    # Dump all results to log file before download
    print(profiler.dumps())
```

### Steps to reproduce
(Paste the commands you ran that produced the error.)

1. run above code with attached 0003.params and 0023.params
2. compare the output

## What have you tried to solve it?

1.
2.

## Environment
MXNet 1.4 cu100mkl, MXNet 1.6 cu100mkl, MXNet1.7 native. All versions could reproduce the issue. 


0003.params, 0023.params and .json are in the following drive:
https://drive.google.com/drive/folders/1NE9wxNXmzPX35ImYBMLvbCpqLX8zq8Xq?usp=sharing




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU inference is very slow for some model checkpoints #19218

Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU inference is very slow for some model checkpoints #19218

Description

Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions