DLRM performance regression on #26963

## 🐛 DLRM performance regression


This check-in cause a little regression on the [DLRM](https://github.com/facebookresearch/dlrm) benchmark.

Without this check-in the DLRM benchmark result is like:
```
numactl --physcpubind=0-27 -m 0 python dlrm_s_pytorch.py --mini-batch-size=2048 --num-batches=100 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --num-indices-per-lookup=100 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=100 --print-time --enable-profiling > model1_CPU_PT_28.log
Min time per iteration = 3591.85
```
with this check-in , the result is like:
```
numactl --physcpubind=0-27 -m 0 python dlrm_s_pytorch.py --mini-batch-size=2048 --num-batches=100 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --num-indices-per-lookup=100 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=100 --print-time --enable-profiling > model1_CPU_PT_28.log
Min time per iteration = 3649.57
```

The profiling data show that there are some time increased in the 
`index_select` , `mm` and `addmm` operations

|                 | This      | Before    |
|-----------------|----------:|----------:|
|  `index_select` | 1559.31ms | 1503.48ms |
| `mm`            | 27.14ms   |   14.58ms |
| `addmm`         | 22.38ms   | 9.79966ms |


## To Reproduce

Steps to reproduce the behavior:

1. Download the DLRM from https://github.com/facebookresearch/dlrm
1. Modify the bench/dlrm_s_benchmark.sh to just run pytorch on cpu version, as
     build=0
     cpu=1
     gpu=0
     pt=1
     c2=0
   And export two KMP variables as
    export KMP_BLOCKTIME=1
    export KMP_AFFINITY="granularity=fine,compact,1,0"
1. Run bench/dlrm_s_benchmark.sh on SKX8180 machine. performance profiling data is stored at file model1_CPU_PT_28.prof
     'This' is got from commit-id: d0a4b2f586e0901c3c65f1f0e0bae15364e28821
     'Before' is got from commit-id: 42e7eb0426190e07339f03d4e6afb61b7ff5ae9c


## Expected behavior


The DLRM performance has no impacted, Thanks 

## Environment

Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py)
(or fill out the checklist below manually).

You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```

 - PyTorch Version (e.g., 1.0): commit-id: d0a4b2f586e0901c3c65f1f0e0bae15364e28821
 - OS (e.g., Linux): Ubuntu 16.04.5 LTS
 - How you installed PyTorch (`conda`, `pip`, source):
 - Build command you used (if compiling from source): python setup.py install
 - Python version: 3.7
 - CUDA/cuDNN version: N/A
 - GPU models and configuration: N/A
 - Any other relevant information:
   GCC version: (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
   CMake version: version 3.14.4

    [pip3] numpy==1.16.2
    [pip3] numpydoc==0.8.0
    [conda] blas                      1.0                         mkl
    [conda] mkl                       2019.0                   pypi_0    pypi
    [conda] mkl-devel                 2019.3                      200
    [conda] mkl-include               2019.0                   pypi_0    pypi
    [conda] mkl-service               1.1.2            py37he904b0f_5
    [conda] mkl_fft                   1.0.10           py37ha843d7b_0
    [conda] mkl_random                1.0.2            py37hd81dba3_0

## Additional context




cc @ezyang @gchanan @zou3519 @jerryzh168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DLRM performance regression on #26963 #28198

🐛 DLRM performance regression

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	This	Before
`index_select`	1559.31ms	1503.48ms
`mm`	27.14ms	14.58ms
`addmm`	22.38ms	9.79966ms

DLRM performance regression on #26963 #28198

Description

🐛 DLRM performance regression

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions