-
Notifications
You must be signed in to change notification settings - Fork 26.3k
PyTorch Profiler Shape aggregation support #20035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cpuhrsch Christian, I think this will be helpful for that use case of debugging shapes you mentioned |
|
Awesome! |
|
Just checking: does this make the fp16/fp32 distinction? It might be useful for making sure you're not accidentally slipping into fp32 ops at some point. |
|
@salexspb if it's easy to support an fp16/32 distinction it could be really useful for identifying:
Essentially grouping the shapes by dtype too would potentially be very beneficial |
@stephenroller, I think this is a good idea, thanks for the suggestion! It should be one of the next follow up tasks. There are several things that need to happen. We need standardize the way we represent reduced precision ops. I don't know if there is a consensus here and would need to check (cc @raghuramank100). In caffe2, for example, some ops do dynamic dispatch while others are special ops which have type name hardcoded into their name. If we want this to be more generic, we may just use a set of tensor data types as a key. Then dependency on operator naming is gone. But this is not being currently passed over to RecordFunction. It can be one more flag to do that along the lines @ilia-cher implemented shape information passing to RecordFunction. |
|
@salexspb we have not just shapes but whole tensors available in RecordFunction, so it should be trivial to save not only shape but also a dtype and pass it back to python? |
|
Another feedback I've just gotten is the ability to generate metrics like flops and memory bandwidth reliably as part of this. That means, a user gets a good flops and memory bandwidth estimation for each call. @salexspb |
|
@ilia-cher , you are right, my bad. Should be easy indeed. I'll have a look (not in this diff though) cc @dzhulgakov @smessmer about ways to integrate cost inference functions into pytorch. Was there any discussion about it already? |
dzhulgakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
cc @dzhulgakov @smessmer about ways to integrate cost inference functions into pytorch. Was there any discussion about it already?
We should probably add an extra field in c10 for new ops. For existing ones, the easiest would be to put it next to operators in native_functions.yaml (and then we can codegen the c10 bits too)
test/test_autograd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I am not sure why we were printing to stdout before, it's not useful in the unit test. can you change it, just call __repr__ on it for the sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith , is there a specific issue with printing to stdout? Normally a python test runner would capture std out. But when you run a test manually output is useful. Is this is not a case for some of the OSS CIs?
No strong feelings about this though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default testrunner for python doesn't capture stdout. But the default one in buck eats it properly.
Anyways, this is a nit and it wasn't introduced in your diff.
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me.
Also timing for not nested events will become wrong.
I don't understand why this is true. If it is, can this be added to the documentation with an explaination.
|
Chatted with @zdevito , the docs are probably fine as they are. The answer to why I warn more about top functions is that if function calls other functions, the overhead of shape collection inside internal functions can be added to self cpu time of the parent function. While the bottom most function's self cpu time stays unaffected by shape vector copy. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@salexspb is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This is useful when you would like to understand performance bottlenecks of your model. One can use the shape analysis in order to fit model to a roofline model of their hardware. Please note that this feature can potentially skew profiling results. Also timing for not nested events will become wrong. One should only use timing for the bottom most events when shape analysis is used. Also for the case where people don't need shapes, profiling should not be affected. As in this case we don't collect shapes, which is the default behavior and this diff doesn't change it. One of the next steps could be, for example, choosing best candidates for quantization. In the scope of this diff I am just adding optional shapes collection into the Even class. After that in python there is minor functionality for providing groupping by shapes. In the output tables shapes are being truncated but in groupping full shape string is used as a key. Here is an example output: test_profiler_shapes (test_autograd.TestAutograd) ... ``` ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- unsigned short 2.30% 305.031us 2.30% 305.031us 305.031us NaN 0.000us 0.000us 1 [[30, 20]] addmm 69.40% 9.199ms 69.40% 9.199ms 9.199ms NaN 0.000us 0.000us 1 [[30], [128, 20], [20, 30], [], []] unsigned short 0.98% 129.326us 0.98% 129.326us 129.326us NaN 0.000us 0.000us 1 [[40, 30]] addmm 27.32% 3.621ms 27.32% 3.621ms 3.621ms NaN 0.000us 0.000us 1 [[40], [128, 30], [30, 40], [], []] ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 13.255ms CUDA time total: 0.000us ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- unsigned short 2.30% 305.031us 2.30% 305.031us 305.031us NaN 0.000us 0.000us 1 [[30, 20]] addmm 69.40% 9.199ms 69.40% 9.199ms 9.199ms NaN 0.000us 0.000us 1 [[30], [128, 20], [20, 30], [], []] unsigned short 0.98% 129.326us 0.98% 129.326us 129.326us NaN 0.000us 0.000us 1 [[40, 30]] addmm 27.32% 3.621ms 27.32% 3.621ms 3.621ms NaN 0.000us 0.000us 1 [[40], [128, 30], [30, 40], [], []] ------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 13.255ms CUDA time total: 0.000us ``` Also added this for older aggregation test: ``` test_profiler_aggregation_lstm (test_autograd.TestAutograd) ... ====================================================================================================================================================================================================== TEST ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- lstm 0.69% 4.606ms 5.30% 35.507ms 35.507ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.67% 4.521ms 5.27% 35.340ms 35.340ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.66% 4.399ms 5.02% 33.638ms 33.638ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.65% 4.354ms 4.92% 32.958ms 32.958ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.65% 4.351ms 4.96% 33.241ms 33.241ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.65% 4.323ms 5.10% 34.163ms 34.163ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.304ms 4.92% 32.938ms 32.938ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.300ms 5.10% 34.172ms 34.172ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.292ms 5.05% 33.828ms 33.828ms NaN 0.000us 0.000us 1 [[5, 3, 10]] lstm 0.64% 4.263ms 4.98% 33.357ms 33.357ms NaN 0.000us 0.000us 1 [[5, 3, 10]] ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 670.120ms CUDA time total: 0.000us ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- sigmoid 15.32% 102.647ms 15.32% 102.647ms 171.078us NaN 0.000us 0.000us 600 [[3, 20]] mul 15.20% 101.854ms 15.20% 101.854ms 169.757us NaN 0.000us 0.000us 600 [[3, 20], [3, 20]] lstm 12.74% 85.355ms 100.00% 670.120ms 33.506ms NaN 0.000us 0.000us 20 [[5, 3, 10]] addmm 11.16% 74.808ms 11.16% 74.808ms 249.361us NaN 0.000us 0.000us 300 [[80], [3, 20], [20, 80], [], []] tanh 9.89% 66.247ms 9.89% 66.247ms 165.617us NaN 0.000us 0.000us 400 [[3, 20]] split 6.42% 43.019ms 6.42% 43.019ms 215.095us NaN 0.000us 0.000us 200 [[3, 80]] add 5.67% 38.020ms 5.67% 38.020ms 190.101us NaN 0.000us 0.000us 200 [[3, 80], [3, 80], []] add 4.81% 32.225ms 4.81% 32.225ms 161.124us NaN 0.000us 0.000us 200 [[3, 20], [3, 20], []] addmm 3.79% 25.380ms 3.79% 25.380ms 253.796us NaN 0.000us 0.000us 100 [[80], [3, 10], [10, 80], [], []] unsigned short 3.72% 24.925ms 3.72% 24.925ms 83.083us NaN 0.000us 0.000us 300 [[80, 20]] ----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 670.120ms CUDA time total: 0.000us Total time based on python measurements: 691.366ms CPU time measurement python side overhead: 3.17% ok ``` Pull Request resolved: pytorch#20035 Differential Revision: D15174987 fbshipit-source-id: 3995f5e7fa891b755207e3024664d94204442896
Summary:
This is useful when you would like to understand performance
bottlenecks of your model. One can use the shape analysis in order to
fit model to a roofline model of their hardware.
Please note that this feature can potentially skew profiling
results. Also timing for not nested events will become wrong. One
should only use timing for the bottom most events when shape analysis
is used. Also for the case where people don't need shapes, profiling
should not be affected. As in this case we don't collect shapes, which
is the default behavior and this diff doesn't change it.
One of the next steps
could be, for example, choosing best candidates for quantization. In
the scope of this diff I am just adding optional shapes collection
into the Even class. After that in python there is minor functionality
for providing groupping by shapes.
In the output tables shapes are being truncated but in groupping full
shape string is used as a key.
Here is an example output:
test_profiler_shapes (test_autograd.TestAutograd) ...
Also added this for older aggregation test:
Differential Revision: D15174987