-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[cuda] faster kernelTransformReduceInnermostDimIndex using cub #21295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akyrola has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
9a4d78b to
00c3710
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akyrola has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Pasting some prior discussion here: @wesolwsk: if you can point me to a workflow using this operator we can check the SM occupancy before and after Incidentally, the macro for num threads per block is set much too high for pyTorch (1024). I previously found that for Caffe2 lowering it from 512 to 128 had a positive effect on perf for use cases with small to medium sized inputs so it may be worth trying for PyTorch @akyrola: I don't have a workflow, this is just random point optimization from a bootcamp task. But below, I run nvprof to get "achieved occupancy", which I think is the same thing? Here is my benchmark script: Before: After: Interestingly the occupancy is lower on the new version, but it runs 15x faster. Occupancy in isolation is not a good metric because very inefficient algorithm can have a high occupancy. I tried with 128 threads, and the performance is 2x slower on my (admittedly too simple benchmark). I'll keep it in 256 but introduce a define. Reading your post (very impressive result!), indeed 128 would be better for smaller inputs... i guess it would be best to vary the number based on the input size. but that's too much work for this case. @wesolwsk Which input sizes have you tested? Can you try a variety of input sizes (e.g. 100, 10000, 1000000). For the last, scaling the number of blocks with number of rows may work better than limiting number of blocks to 1024. |
aten/src/THC/THCTensorMathReduce.cuh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @wesolwsk: can we assume anything about the size of the innermost dimension compared to the other dimensions? Optimal implementations will be different depending on which is larger.
aten/src/THC/THCTensorMathReduce.cuh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @wesolwsk: can you try without this limit? 1024 blocks yields occupancy of about 13 on Volta for short rows. Maybe you can get better performance for large input sizes if you just set the number of blocks to the row count. It would also allow you to get rid of the external for loop, which could help a little more.
test/test_torch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how long this test takes to run? If it's more than a few seconds, we should mark it slowTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is fast, some tens of millis
|
I'm generally uneasy about improvements to legacy reductions. Ideally all reduction operators should be ported to TensorIterator reduction that is achieving close to SOL speed on a wide variety of parameters. max/min/max_index/min_index currently don't go through TensorIterator, but they should. |
|
Indeed @wesolwsk, the speed depends a lot of the shape. Old version is faster if there are many rows. i was doing this in the context of sparse tensors, where the number of rows for the indices is always very small (as it is the number of tensor dimensions). But definitely should develop alternative kernel for the "skinny" tensors. But given what @ngimel says, perhaps I should drop this? We though have a lot of CUB usage in the caffe2 side. |
|
@akyrola, nice perf comparison. Maybe we can keep both implementations for now but select based on input dimensions (looks like the new one is better for rows of size > 1000. If TensorIterator reductions can handle both cases well then that would be even better. |
Summary: When optimizing sparse tensor coalesce, I recognized that this kernel was taking bulk of the time (see PR pytorch#21214). It is used (at least) in the sparse tensor constructor to validate that the index tensor min/max indices are valid. This PR rewrites the kernel by using CUB reduction ,achieving about 16x speedup. With my benchmark for coalesce, before nvprof showed: ``` # GPU activities: 45.47% 2.42669s 101 24.027ms 23.862ms 28.968ms void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) # 45.41% 2.42386s 101 23.999ms 23.857ms 28.944ms void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) ``` ... after this: ``` GPU activities: 19.50% 154.92ms 101 1.5338ms 1.5285ms 1.5987ms void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) 19.45% 154.52ms 101 1.5299ms 1.5247ms 1.5933ms void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) ``` Pull Request resolved: pytorch#21295 Differential Revision: D15606873 fbshipit-source-id: e5bc86933efa44c36c3b2942114a04c20abd7700
00c3710 to
1c8c84b
Compare
|
TensorIterator sum reduction is achieving higher bandwidth everywhere than current min/max reduction (and looking at the speed-up numbers, even than improved min reduction. min/max numbers should be similar to sum if min/max are moved to use TensorIterator. Benchmarking script and output on V100: |
|
@ngimel current min/max returns also the index (i.e are both min and argmin), are these kind of reductions supported by tensor iterations already? |
|
Btw, the benchmark script you copypasted might be suspectible for startup overheads? I think the first time kernel is run, it takes much longer. I was using the "timeit" module to avoid this, by taking the minimum example. My tests were done on an old M40. |
Yes, it's pretty flexible, you just have to provide the necessary functors to it (e.g. I'm pretty sure it is currently returning both mean and std from a single reduction pass, using the right reduction functor). |
|
Ok, let me work on the iterator approach perhaps later this week. Cannot promise though, so if someone wants to do it instead, please do. |
|
Hi @akyrola! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |

When optimizing sparse tensor coalesce, I recognized that this kernel was taking bulk of the time (see PR #21214). It is used (at least) in the sparse tensor constructor to validate that the index tensor min/max indices are valid.
This PR rewrites the kernel by using CUB reduction ,achieving about 16x speedup. With my benchmark for coalesce, before nvprof showed:
... after this:
Test: test/torch.py and test/sparse.py pass.