PyTorch Profiler Shape aggregation support #20035

salexspb · 2019-05-01T22:50:52Z

Summary:
This is useful when you would like to understand performance
bottlenecks of your model. One can use the shape analysis in order to
fit model to a roofline model of their hardware.

Please note that this feature can potentially skew profiling
results. Also timing for not nested events will become wrong. One
should only use timing for the bottom most events when shape analysis
is used. Also for the case where people don't need shapes, profiling
should not be affected. As in this case we don't collect shapes, which
is the default behavior and this diff doesn't change it.

One of the next steps
could be, for example, choosing best candidates for quantization. In
the scope of this diff I am just adding optional shapes collection
into the Even class. After that in python there is minor functionality
for providing groupping by shapes.

In the output tables shapes are being truncated but in groupping full
shape string is used as a key.

Here is an example output:

test_profiler_shapes (test_autograd.TestAutograd) ...

------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
unsigned short      2.30%            305.031us        2.30%            305.031us        305.031us        NaN              0.000us          0.000us          1                [[30, 20]]
addmm               69.40%           9.199ms          69.40%           9.199ms          9.199ms          NaN              0.000us          0.000us          1                [[30], [128, 20], [20, 30], [], []]
unsigned short      0.98%            129.326us        0.98%            129.326us        129.326us        NaN              0.000us          0.000us          1                [[40, 30]]
addmm               27.32%           3.621ms          27.32%           3.621ms          3.621ms          NaN              0.000us          0.000us          1                [[40], [128, 30], [30, 40], [], []]
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us

------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
unsigned short      2.30%            305.031us        2.30%            305.031us        305.031us        NaN              0.000us          0.000us          1                [[30, 20]]
addmm               69.40%           9.199ms          69.40%           9.199ms          9.199ms          NaN              0.000us          0.000us          1                [[30], [128, 20], [20, 30], [], []]
unsigned short      0.98%            129.326us        0.98%            129.326us        129.326us        NaN              0.000us          0.000us          1                [[40, 30]]
addmm               27.32%           3.621ms          27.32%           3.621ms          3.621ms          NaN              0.000us          0.000us          1                [[40], [128, 30], [30, 40], [], []]
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us

Also added this for older aggregation test:

test_profiler_aggregation_lstm (test_autograd.TestAutograd) ...
======================================================================================================================================================================================================
TEST
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
lstm                     0.69%            4.606ms          5.30%            35.507ms         35.507ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.67%            4.521ms          5.27%            35.340ms         35.340ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.66%            4.399ms          5.02%            33.638ms         33.638ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.354ms          4.92%            32.958ms         32.958ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.351ms          4.96%            33.241ms         33.241ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.323ms          5.10%            34.163ms         34.163ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.304ms          4.92%            32.938ms         32.938ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.300ms          5.10%            34.172ms         34.172ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.292ms          5.05%            33.828ms         33.828ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.263ms          4.98%            33.357ms         33.357ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
sigmoid                  15.32%           102.647ms        15.32%           102.647ms        171.078us        NaN              0.000us          0.000us          600              [[3, 20]]
mul                      15.20%           101.854ms        15.20%           101.854ms        169.757us        NaN              0.000us          0.000us          600              [[3, 20], [3, 20]]
lstm                     12.74%           85.355ms         100.00%          670.120ms        33.506ms         NaN              0.000us          0.000us          20               [[5, 3, 10]]
addmm                    11.16%           74.808ms         11.16%           74.808ms         249.361us        NaN              0.000us          0.000us          300              [[80], [3, 20], [20, 80], [], []]
tanh                     9.89%            66.247ms         9.89%            66.247ms         165.617us        NaN              0.000us          0.000us          400              [[3, 20]]
split                    6.42%            43.019ms         6.42%            43.019ms         215.095us        NaN              0.000us          0.000us          200              [[3, 80]]
add                      5.67%            38.020ms         5.67%            38.020ms         190.101us        NaN              0.000us          0.000us          200              [[3, 80], [3, 80], []]
add                      4.81%            32.225ms         4.81%            32.225ms         161.124us        NaN              0.000us          0.000us          200              [[3, 20], [3, 20], []]
addmm                    3.79%            25.380ms         3.79%            25.380ms         253.796us        NaN              0.000us          0.000us          100              [[80], [3, 10], [10, 80], [], []]
unsigned short           3.72%            24.925ms         3.72%            24.925ms         83.083us         NaN              0.000us          0.000us          300              [[80, 20]]
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us

Total time based on python measurements:  691.366ms
CPU time measurement python side overhead: 3.17%
ok

Differential Revision: D15174987

ilia-cher · 2019-05-01T23:27:00Z

@cpuhrsch Christian, I think this will be helpful for that use case of debugging shapes you mentioned

cpuhrsch · 2019-05-02T17:24:31Z

Awesome!

stephenroller · 2019-05-02T18:07:40Z

Just checking: does this make the fp16/fp32 distinction? It might be useful for making sure you're not accidentally slipping into fp32 ops at some point.

stephenroller · 2019-05-02T18:47:50Z

@salexspb if it's easy to support an fp16/32 distinction it could be really useful for identifying:

If I'm converting a model to fp16, was I successful at finding anywhere I forgot to use .as_type(dtype) instead of .float(), or had to do some ops in fp32 temporarily but forgot to convert them back
Are the ops having the same profile ratios in fp16/fp32 (maybe an op is particularly inefficient in fp16 and I should avoid it).

Essentially grouping the shapes by dtype too would potentially be very beneficial

salexspb · 2019-05-02T20:06:31Z

@salexspb if it's easy to support an fp16/32 distinction it could be really useful for identifying:

If I'm converting a model to fp16, was I successful at finding anywhere I forgot to use .as_type(dtype) instead of .float(), or had to do some ops in fp32 temporarily but forgot to convert them back

Are the ops having the same profile ratios in fp16/fp32 (maybe an op is particularly inefficient in fp16 and I should avoid it).

Essentially grouping the shapes by dtype too would potentially be very beneficial

@stephenroller, I think this is a good idea, thanks for the suggestion! It should be one of the next follow up tasks. There are several things that need to happen. We need standardize the way we represent reduced precision ops. I don't know if there is a consensus here and would need to check (cc @raghuramank100). In caffe2, for example, some ops do dynamic dispatch while others are special ops which have type name hardcoded into their name.

If we want this to be more generic, we may just use a set of tensor data types as a key. Then dependency on operator naming is gone. But this is not being currently passed over to RecordFunction. It can be one more flag to do that along the lines @ilia-cher implemented shape information passing to RecordFunction.

ilia-cher · 2019-05-02T21:35:40Z

@salexspb we have not just shapes but whole tensors available in RecordFunction, so it should be trivial to save not only shape but also a dtype and pass it back to python?

cpuhrsch · 2019-05-02T21:47:43Z

Another feedback I've just gotten is the ability to generate metrics like flops and memory bandwidth reliably as part of this. That means, a user gets a good flops and memory bandwidth estimation for each call. @salexspb

salexspb · 2019-05-02T21:51:53Z

@ilia-cher , you are right, my bad. Should be easy indeed. I'll have a look (not in this diff though)
@cpuhrsch , flops and memory reads / writes are hard. On CPU there are counters, but getting those right is not easy. And validation is tricky. In Caffe2 we used analytical formulas. This is something which would be useful to add to pytorch as well. In caffe2 it is a part of the operator schema.

cc @dzhulgakov @smessmer about ways to integrate cost inference functions into pytorch. Was there any discussion about it already?

dzhulgakov

This is great!

cc @dzhulgakov @smessmer about ways to integrate cost inference functions into pytorch. Was there any discussion about it already?

We should probably add an extra field in c10 for new ops. For existing ones, the easiest would be to put it next to operators in native_functions.yaml (and then we can codegen the c10 bits too)

torch/csrc/autograd/profiler.cpp

torch/csrc/autograd/init.cpp

soumith · 2019-05-04T18:43:27Z

test/test_autograd.py

actually I am not sure why we were printing to stdout before, it's not useful in the unit test. can you change it, just call __repr__ on it for the sanity check.

@soumith , is there a specific issue with printing to stdout? Normally a python test runner would capture std out. But when you run a test manually output is useful. Is this is not a case for some of the OSS CIs?

No strong feelings about this though :)

the default testrunner for python doesn't capture stdout. But the default one in buck eats it properly.
Anyways, this is a nit and it wasn't introduced in your diff.

torch/csrc/autograd/profiler.cpp

zdevito

Code looks good to me.

Also timing for not nested events will become wrong.
I don't understand why this is true. If it is, can this be added to the documentation with an explaination.

salexspb · 2019-05-07T17:42:22Z

Chatted with @zdevito , the docs are probably fine as they are. The answer to why I warn more about top functions is that if function calls other functions, the overhead of shape collection inside internal functions can be added to self cpu time of the parent function. While the bottom most function's self cpu time stays unaffected by shape vector copy.

facebook-github-bot

@salexspb is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This is useful when you would like to understand performance bottlenecks of your model. One can use the shape analysis in order to fit model to a roofline model of their hardware. Please note that this feature can potentially skew profiling results. Also timing for not nested events will become wrong. One should only use timing for the bottom most events when shape analysis is used. Also for the case where people don't need shapes, profiling should not be affected. As in this case we don't collect shapes, which is the default behavior and this diff doesn't change it. One of the next steps could be, for example, choosing best candidates for quantization. In the scope of this diff I am just adding optional shapes collection into the Even class. After that in python there is minor functionality for providing groupping by shapes. In the output tables shapes are being truncated but in groupping full shape string is used as a key. Here is an example output: test_profiler_shapes (test_autograd.TestAutograd) ... ``` ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- unsigned short 2.30% 305.031us 2.30% 305.031us 305.031us NaN 0.000us 0.000us 1 [[30, 20]] addmm 69.40% 9.199ms 69.40% 9.199ms 9.199ms NaN 0.000us 0.000us 1 [[30], [128, 20], [20, 30], [], []] unsigned short 0.98% 129.326us 0.98% 129.326us 129.326us NaN 0.000us 0.000us 1 [[40, 30]] addmm 27.32% 3.621ms 27.32% 3.621ms 3.621ms NaN 0.000us 0.000us 1 [[40], [128, 30], [30, 40], [], []] ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 13.255ms CUDA time total: 0.000us ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- unsigned short 2.30% 305.031us 2.30% 305.031us 305.031us NaN 0.000us 0.000us 1 [[30, 20]] addmm 69.40% 9.199ms 69.40% 9.199ms 9.199ms NaN 0.000us 0.000us 1 [[30], [128, 20], [20, 30], [], []] unsigned short 0.98% 129.326us 0.98% 129.326us 129.326us NaN 0.000us 0.000us 1 [[40, 30]] addmm 27.32% 3.621ms 27.32% 3.621ms 3.621ms NaN 0.000us 0.000us 1 [[40], [128, 30], [30, 40], [], []] ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 13.255ms CUDA time total: 0.000us ``` Also added this for older aggregation test: ``` test_profiler_aggregation_lstm (test_autograd.TestAutograd) ... ====================================================================================================================================================================================================== TEST ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- lstm 0.69% 4.606ms 5.30% 35.507ms 35.507ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.67% 4.521ms 5.27% 35.340ms 35.340ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.66% 4.399ms 5.02% 33.638ms 33.638ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.65% 4.354ms 4.92% 32.958ms 32.958ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.65% 4.351ms 4.96% 33.241ms 33.241ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.65% 4.323ms 5.10% 34.163ms 34.163ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.304ms 4.92% 32.938ms 32.938ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.300ms 5.10% 34.172ms 34.172ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.292ms 5.05% 33.828ms 33.828ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.263ms 4.98% 33.357ms 33.357ms NaN 0.000us 0.000us 1 [[5, 3, 10]] ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 670.120ms CUDA time total: 0.000us ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- sigmoid 15.32% 102.647ms 15.32% 102.647ms 171.078us NaN 0.000us 0.000us 600 [[3, 20]] mul 15.20% 101.854ms 15.20% 101.854ms 169.757us NaN 0.000us 0.000us 600 [[3, 20], [3, 20]] lstm 12.74% 85.355ms 100.00% 670.120ms 33.506ms NaN 0.000us 0.000us 20 [[5, 3, 10]] addmm 11.16% 74.808ms 11.16% 74.808ms 249.361us NaN 0.000us 0.000us 300 [[80], [3, 20], [20, 80], [], []] tanh 9.89% 66.247ms 9.89% 66.247ms 165.617us NaN 0.000us 0.000us 400 [[3, 20]] split 6.42% 43.019ms 6.42% 43.019ms 215.095us NaN 0.000us 0.000us 200 [[3, 80]] add 5.67% 38.020ms 5.67% 38.020ms 190.101us NaN 0.000us 0.000us 200 [[3, 80], [3, 80], []] add 4.81% 32.225ms 4.81% 32.225ms 161.124us NaN 0.000us 0.000us 200 [[3, 20], [3, 20], []] addmm 3.79% 25.380ms 3.79% 25.380ms 253.796us NaN 0.000us 0.000us 100 [[80], [3, 10], [10, 80], [], []] unsigned short 3.72% 24.925ms 3.72% 24.925ms 83.083us NaN 0.000us 0.000us 300 [[80, 20]] ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 670.120ms CUDA time total: 0.000us Total time based on python measurements: 691.366ms CPU time measurement python side overhead: 3.17% ok ``` Pull Request resolved: pytorch#20035 Differential Revision: D15174987 fbshipit-source-id: 3995f5e7fa891b755207e3024664d94204442896

facebook-github-bot · 2019-05-08T02:41:02Z

@salexspb merged this pull request in 864cfbc.

pytorchbot added the module: autograd Related to torch.autograd, and the autograd engine in general label May 1, 2019

salexspb force-pushed the export-D15174987 branch from 78c22bf to 3507f78 Compare May 1, 2019 22:57

salexspb force-pushed the export-D15174987 branch from 3507f78 to 733b702 Compare May 2, 2019 17:22

salexspb requested review from ilia-cher and zdevito May 2, 2019 17:24

salexspb requested a review from zheng-xq May 2, 2019 20:09

salexspb force-pushed the export-D15174987 branch from 733b702 to 284ae33 Compare May 3, 2019 00:01

dzhulgakov reviewed May 3, 2019

View reviewed changes

torch/csrc/autograd/profiler.cpp Outdated Show resolved Hide resolved

torch/csrc/autograd/profiler.cpp Outdated Show resolved Hide resolved

torch/csrc/autograd/init.cpp Outdated Show resolved Hide resolved

salexspb force-pushed the export-D15174987 branch from 284ae33 to 4f58687 Compare May 3, 2019 20:55

soumith reviewed May 4, 2019

View reviewed changes

salexspb force-pushed the export-D15174987 branch from 4f58687 to 68b29b5 Compare May 6, 2019 21:32

salexspb force-pushed the export-D15174987 branch from 68b29b5 to ed4299c Compare May 6, 2019 21:32

zdevito approved these changes May 7, 2019

View reviewed changes

facebook-github-bot reviewed May 7, 2019

View reviewed changes

salexspb force-pushed the export-D15174987 branch from ed4299c to 9655735 Compare May 7, 2019 19:53

facebook-github-bot closed this in 864cfbc May 7, 2019

facebook-github-bot added the merged label May 8, 2019

zou3519 mentioned this pull request Jun 12, 2019

with emit_nvtx context manager is broken in current master. #21644

Closed

mruberry added the Merged label Oct 28, 2020

PyTorch Profiler Shape aggregation support #20035

PyTorch Profiler Shape aggregation support #20035

Uh oh!

Conversation

salexspb commented May 1, 2019

Uh oh!

ilia-cher commented May 1, 2019

Uh oh!

cpuhrsch commented May 2, 2019

Uh oh!

stephenroller commented May 2, 2019

Uh oh!

stephenroller commented May 2, 2019

Uh oh!

salexspb commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilia-cher commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuhrsch commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

salexspb commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

soumith May 4, 2019

Choose a reason for hiding this comment

Uh oh!

salexspb May 6, 2019

Choose a reason for hiding this comment

Uh oh!

soumith May 6, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

salexspb commented May 7, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

salexspb commented May 2, 2019 •

edited

Loading

ilia-cher commented May 2, 2019 •

edited

Loading

cpuhrsch commented May 2, 2019 •

edited

Loading

salexspb commented May 2, 2019 •

edited

Loading