Skip to content

DLRM performance regression on #26963 #28198

@JianpingChen066

Description

@JianpingChen066

🐛 DLRM performance regression

This check-in cause a little regression on the DLRM benchmark.

Without this check-in the DLRM benchmark result is like:

numactl --physcpubind=0-27 -m 0 python dlrm_s_pytorch.py --mini-batch-size=2048 --num-batches=100 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --num-indices-per-lookup=100 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=100 --print-time --enable-profiling > model1_CPU_PT_28.log
Min time per iteration = 3591.85

with this check-in , the result is like:

numactl --physcpubind=0-27 -m 0 python dlrm_s_pytorch.py --mini-batch-size=2048 --num-batches=100 --data-generation=random --arch-mlp-bot=512-512-64 --arch-mlp-top=1024-1024-1024-1 --arch-sparse-feature-size=64 --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 --num-indices-per-lookup=100 --arch-interaction-op=dot --numpy-rand-seed=727 --print-freq=100 --print-time --enable-profiling > model1_CPU_PT_28.log
Min time per iteration = 3649.57

The profiling data show that there are some time increased in the
index_select , mm and addmm operations

This Before
index_select 1559.31ms 1503.48ms
mm 27.14ms 14.58ms
addmm 22.38ms 9.79966ms

To Reproduce

Steps to reproduce the behavior:

  1. Download the DLRM from https://github.com/facebookresearch/dlrm
  2. Modify the bench/dlrm_s_benchmark.sh to just run pytorch on cpu version, as
    build=0
    cpu=1
    gpu=0
    pt=1
    c2=0
    And export two KMP variables as
    export KMP_BLOCKTIME=1
    export KMP_AFFINITY="granularity=fine,compact,1,0"
  3. Run bench/dlrm_s_benchmark.sh on SKX8180 machine. performance profiling data is stored at file model1_CPU_PT_28.prof
    'This' is got from commit-id: d0a4b2f
    'Before' is got from commit-id: 42e7eb0

Expected behavior

The DLRM performance has no impacted, Thanks

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0): commit-id: d0a4b2f

  • OS (e.g., Linux): Ubuntu 16.04.5 LTS

  • How you installed PyTorch (conda, pip, source):

  • Build command you used (if compiling from source): python setup.py install

  • Python version: 3.7

  • CUDA/cuDNN version: N/A

  • GPU models and configuration: N/A

  • Any other relevant information:
    GCC version: (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
    CMake version: version 3.14.4

    [pip3] numpy==1.16.2
    [pip3] numpydoc==0.8.0
    [conda] blas 1.0 mkl
    [conda] mkl 2019.0 pypi_0 pypi
    [conda] mkl-devel 2019.3 200
    [conda] mkl-include 2019.0 pypi_0 pypi
    [conda] mkl-service 1.1.2 py37he904b0f_5
    [conda] mkl_fft 1.0.10 py37ha843d7b_0
    [conda] mkl_random 1.0.2 py37hd81dba3_0

Additional context

cc @ezyang @gchanan @zou3519 @jerryzh168

Metadata

Metadata

Assignees

Labels

high prioritytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions