Skip to content

Conversation

@salexspb
Copy link
Contributor

@salexspb salexspb commented May 1, 2019

Summary:
This is useful when you would like to understand performance
bottlenecks of your model. One can use the shape analysis in order to
fit model to a roofline model of their hardware.

Please note that this feature can potentially skew profiling
results. Also timing for not nested events will become wrong. One
should only use timing for the bottom most events when shape analysis
is used. Also for the case where people don't need shapes, profiling
should not be affected. As in this case we don't collect shapes, which
is the default behavior and this diff doesn't change it.

One of the next steps
could be, for example, choosing best candidates for quantization. In
the scope of this diff I am just adding optional shapes collection
into the Even class. After that in python there is minor functionality
for providing groupping by shapes.

In the output tables shapes are being truncated but in groupping full
shape string is used as a key.

Here is an example output:

test_profiler_shapes (test_autograd.TestAutograd) ...

------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
unsigned short      2.30%            305.031us        2.30%            305.031us        305.031us        NaN              0.000us          0.000us          1                [[30, 20]]
addmm               69.40%           9.199ms          69.40%           9.199ms          9.199ms          NaN              0.000us          0.000us          1                [[30], [128, 20], [20, 30], [], []]
unsigned short      0.98%            129.326us        0.98%            129.326us        129.326us        NaN              0.000us          0.000us          1                [[40, 30]]
addmm               27.32%           3.621ms          27.32%           3.621ms          3.621ms          NaN              0.000us          0.000us          1                [[40], [128, 30], [30, 40], [], []]
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us

------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
unsigned short      2.30%            305.031us        2.30%            305.031us        305.031us        NaN              0.000us          0.000us          1                [[30, 20]]
addmm               69.40%           9.199ms          69.40%           9.199ms          9.199ms          NaN              0.000us          0.000us          1                [[30], [128, 20], [20, 30], [], []]
unsigned short      0.98%            129.326us        0.98%            129.326us        129.326us        NaN              0.000us          0.000us          1                [[40, 30]]
addmm               27.32%           3.621ms          27.32%           3.621ms          3.621ms          NaN              0.000us          0.000us          1                [[40], [128, 30], [30, 40], [], []]
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us

Also added this for older aggregation test:

test_profiler_aggregation_lstm (test_autograd.TestAutograd) ...
======================================================================================================================================================================================================
TEST
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
lstm                     0.69%            4.606ms          5.30%            35.507ms         35.507ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.67%            4.521ms          5.27%            35.340ms         35.340ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.66%            4.399ms          5.02%            33.638ms         33.638ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.354ms          4.92%            32.958ms         32.958ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.351ms          4.96%            33.241ms         33.241ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.323ms          5.10%            34.163ms         34.163ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.304ms          4.92%            32.938ms         32.938ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.300ms          5.10%            34.172ms         34.172ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.292ms          5.05%            33.828ms         33.828ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.263ms          4.98%            33.357ms         33.357ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
sigmoid                  15.32%           102.647ms        15.32%           102.647ms        171.078us        NaN              0.000us          0.000us          600              [[3, 20]]
mul                      15.20%           101.854ms        15.20%           101.854ms        169.757us        NaN              0.000us          0.000us          600              [[3, 20], [3, 20]]
lstm                     12.74%           85.355ms         100.00%          670.120ms        33.506ms         NaN              0.000us          0.000us          20               [[5, 3, 10]]
addmm                    11.16%           74.808ms         11.16%           74.808ms         249.361us        NaN              0.000us          0.000us          300              [[80], [3, 20], [20, 80], [], []]
tanh                     9.89%            66.247ms         9.89%            66.247ms         165.617us        NaN              0.000us          0.000us          400              [[3, 20]]
split                    6.42%            43.019ms         6.42%            43.019ms         215.095us        NaN              0.000us          0.000us          200              [[3, 80]]
add                      5.67%            38.020ms         5.67%            38.020ms         190.101us        NaN              0.000us          0.000us          200              [[3, 80], [3, 80], []]
add                      4.81%            32.225ms         4.81%            32.225ms         161.124us        NaN              0.000us          0.000us          200              [[3, 20], [3, 20], []]
addmm                    3.79%            25.380ms         3.79%            25.380ms         253.796us        NaN              0.000us          0.000us          100              [[80], [3, 10], [10, 80], [], []]
unsigned short           3.72%            24.925ms         3.72%            24.925ms         83.083us         NaN              0.000us          0.000us          300              [[80, 20]]
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us

Total time based on python measurements:  691.366ms
CPU time measurement python side overhead: 3.17%
ok

Differential Revision: D15174987

@pytorchbot pytorchbot added the module: autograd Related to torch.autograd, and the autograd engine in general label May 1, 2019
@salexspb salexspb force-pushed the export-D15174987 branch from 78c22bf to 3507f78 Compare May 1, 2019 22:57
@ilia-cher
Copy link
Contributor

@cpuhrsch Christian, I think this will be helpful for that use case of debugging shapes you mentioned

@salexspb salexspb force-pushed the export-D15174987 branch from 3507f78 to 733b702 Compare May 2, 2019 17:22
@salexspb salexspb requested review from ilia-cher and zdevito May 2, 2019 17:24
@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 2, 2019

Awesome!

@stephenroller
Copy link
Contributor

Just checking: does this make the fp16/fp32 distinction? It might be useful for making sure you're not accidentally slipping into fp32 ops at some point.

@stephenroller
Copy link
Contributor

@salexspb if it's easy to support an fp16/32 distinction it could be really useful for identifying:

  1. If I'm converting a model to fp16, was I successful at finding anywhere I forgot to use .as_type(dtype) instead of .float(), or had to do some ops in fp32 temporarily but forgot to convert them back
  2. Are the ops having the same profile ratios in fp16/fp32 (maybe an op is particularly inefficient in fp16 and I should avoid it).

Essentially grouping the shapes by dtype too would potentially be very beneficial

@salexspb
Copy link
Contributor Author

salexspb commented May 2, 2019

@salexspb if it's easy to support an fp16/32 distinction it could be really useful for identifying:

  1. If I'm converting a model to fp16, was I successful at finding anywhere I forgot to use .as_type(dtype) instead of .float(), or had to do some ops in fp32 temporarily but forgot to convert them back
  2. Are the ops having the same profile ratios in fp16/fp32 (maybe an op is particularly inefficient in fp16 and I should avoid it).

Essentially grouping the shapes by dtype too would potentially be very beneficial

@stephenroller, I think this is a good idea, thanks for the suggestion! It should be one of the next follow up tasks. There are several things that need to happen. We need standardize the way we represent reduced precision ops. I don't know if there is a consensus here and would need to check (cc @raghuramank100). In caffe2, for example, some ops do dynamic dispatch while others are special ops which have type name hardcoded into their name.

If we want this to be more generic, we may just use a set of tensor data types as a key. Then dependency on operator naming is gone. But this is not being currently passed over to RecordFunction. It can be one more flag to do that along the lines @ilia-cher implemented shape information passing to RecordFunction.

@salexspb salexspb requested a review from zheng-xq May 2, 2019 20:09
@ilia-cher
Copy link
Contributor

ilia-cher commented May 2, 2019

@salexspb we have not just shapes but whole tensors available in RecordFunction, so it should be trivial to save not only shape but also a dtype and pass it back to python?

@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 2, 2019

Another feedback I've just gotten is the ability to generate metrics like flops and memory bandwidth reliably as part of this. That means, a user gets a good flops and memory bandwidth estimation for each call. @salexspb

@salexspb
Copy link
Contributor Author

salexspb commented May 2, 2019

@ilia-cher , you are right, my bad. Should be easy indeed. I'll have a look (not in this diff though)
@cpuhrsch , flops and memory reads / writes are hard. On CPU there are counters, but getting those right is not easy. And validation is tricky. In Caffe2 we used analytical formulas. This is something which would be useful to add to pytorch as well. In caffe2 it is a part of the operator schema.

cc @dzhulgakov @smessmer about ways to integrate cost inference functions into pytorch. Was there any discussion about it already?

@salexspb salexspb force-pushed the export-D15174987 branch from 733b702 to 284ae33 Compare May 3, 2019 00:01
Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

cc @dzhulgakov @smessmer about ways to integrate cost inference functions into pytorch. Was there any discussion about it already?

We should probably add an extra field in c10 for new ops. For existing ones, the easiest would be to put it next to operators in native_functions.yaml (and then we can codegen the c10 bits too)

@salexspb salexspb force-pushed the export-D15174987 branch from 284ae33 to 4f58687 Compare May 3, 2019 20:55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I am not sure why we were printing to stdout before, it's not useful in the unit test. can you change it, just call __repr__ on it for the sanity check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith , is there a specific issue with printing to stdout? Normally a python test runner would capture std out. But when you run a test manually output is useful. Is this is not a case for some of the OSS CIs?

No strong feelings about this though :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default testrunner for python doesn't capture stdout. But the default one in buck eats it properly.
Anyways, this is a nit and it wasn't introduced in your diff.

@salexspb salexspb force-pushed the export-D15174987 branch from 4f58687 to 68b29b5 Compare May 6, 2019 21:32
@salexspb salexspb force-pushed the export-D15174987 branch from 68b29b5 to ed4299c Compare May 6, 2019 21:32
Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me.

Also timing for not nested events will become wrong.
I don't understand why this is true. If it is, can this be added to the documentation with an explaination.

@salexspb
Copy link
Contributor Author

salexspb commented May 7, 2019

Chatted with @zdevito , the docs are probably fine as they are. The answer to why I warn more about top functions is that if function calls other functions, the overhead of shape collection inside internal functions can be added to self cpu time of the parent function. While the bottom most function's self cpu time stays unaffected by shape vector copy.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salexspb is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This is useful when you would like to understand performance
bottlenecks of your model.  One can use the shape analysis in order to
fit model to a roofline model of their hardware.

Please note that this feature can potentially skew profiling
results. Also timing for not nested events will become wrong. One
should only use timing for the bottom most events when shape analysis
is used. Also for the case where people don't need shapes, profiling
should not be affected. As in this case we don't collect shapes, which
is the default behavior and this diff doesn't change it.

One of the next steps
could be, for example, choosing best candidates for quantization. In
the scope of this diff I am just adding optional shapes collection
into the Even class. After that in python there is minor functionality
for providing groupping by shapes.

In the output tables shapes are being truncated but in groupping full
shape string is used as a key.

Here is an example output:

test_profiler_shapes (test_autograd.TestAutograd) ...
```
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
unsigned short      2.30%            305.031us        2.30%            305.031us        305.031us        NaN              0.000us          0.000us          1                [[30, 20]]
addmm               69.40%           9.199ms          69.40%           9.199ms          9.199ms          NaN              0.000us          0.000us          1                [[30], [128, 20], [20, 30], [], []]
unsigned short      0.98%            129.326us        0.98%            129.326us        129.326us        NaN              0.000us          0.000us          1                [[40, 30]]
addmm               27.32%           3.621ms          27.32%           3.621ms          3.621ms          NaN              0.000us          0.000us          1                [[40], [128, 30], [30, 40], [], []]
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us

------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
unsigned short      2.30%            305.031us        2.30%            305.031us        305.031us        NaN              0.000us          0.000us          1                [[30, 20]]
addmm               69.40%           9.199ms          69.40%           9.199ms          9.199ms          NaN              0.000us          0.000us          1                [[30], [128, 20], [20, 30], [], []]
unsigned short      0.98%            129.326us        0.98%            129.326us        129.326us        NaN              0.000us          0.000us          1                [[40, 30]]
addmm               27.32%           3.621ms          27.32%           3.621ms          3.621ms          NaN              0.000us          0.000us          1                [[40], [128, 30], [30, 40], [], []]
------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us
```

Also added this for older aggregation test:

```
test_profiler_aggregation_lstm (test_autograd.TestAutograd) ...
======================================================================================================================================================================================================
TEST
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
lstm                     0.69%            4.606ms          5.30%            35.507ms         35.507ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.67%            4.521ms          5.27%            35.340ms         35.340ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.66%            4.399ms          5.02%            33.638ms         33.638ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.354ms          4.92%            32.958ms         32.958ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.351ms          4.96%            33.241ms         33.241ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.65%            4.323ms          5.10%            34.163ms         34.163ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.304ms          4.92%            32.938ms         32.938ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.300ms          5.10%            34.172ms         34.172ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.292ms          5.05%            33.828ms         33.828ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
lstm                     0.64%            4.263ms          4.98%            33.357ms         33.357ms         NaN              0.000us          0.000us          1                [[5, 3, 10]]
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
sigmoid                  15.32%           102.647ms        15.32%           102.647ms        171.078us        NaN              0.000us          0.000us          600              [[3, 20]]
mul                      15.20%           101.854ms        15.20%           101.854ms        169.757us        NaN              0.000us          0.000us          600              [[3, 20], [3, 20]]
lstm                     12.74%           85.355ms         100.00%          670.120ms        33.506ms         NaN              0.000us          0.000us          20               [[5, 3, 10]]
addmm                    11.16%           74.808ms         11.16%           74.808ms         249.361us        NaN              0.000us          0.000us          300              [[80], [3, 20], [20, 80], [], []]
tanh                     9.89%            66.247ms         9.89%            66.247ms         165.617us        NaN              0.000us          0.000us          400              [[3, 20]]
split                    6.42%            43.019ms         6.42%            43.019ms         215.095us        NaN              0.000us          0.000us          200              [[3, 80]]
add                      5.67%            38.020ms         5.67%            38.020ms         190.101us        NaN              0.000us          0.000us          200              [[3, 80], [3, 80], []]
add                      4.81%            32.225ms         4.81%            32.225ms         161.124us        NaN              0.000us          0.000us          200              [[3, 20], [3, 20], []]
addmm                    3.79%            25.380ms         3.79%            25.380ms         253.796us        NaN              0.000us          0.000us          100              [[80], [3, 10], [10, 80], [], []]
unsigned short           3.72%            24.925ms         3.72%            24.925ms         83.083us         NaN              0.000us          0.000us          300              [[80, 20]]
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us

Total time based on python measurements:  691.366ms
CPU time measurement python side overhead: 3.17%
ok
```
Pull Request resolved: pytorch#20035

Differential Revision: D15174987

fbshipit-source-id: 3995f5e7fa891b755207e3024664d94204442896
@facebook-github-bot
Copy link
Contributor

@salexspb merged this pull request in 864cfbc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: autograd Related to torch.autograd, and the autograd engine in general

Projects

None yet

Development

Successfully merging this pull request may close these issues.