-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup #21214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ssnl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! This is the correct fix.
@ezyang If the cmake change is done at the correct place, we should merge this.
bddppq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akyrola has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Studied why sparse tensor coalesce was slow: issue #10757. Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used ``kernelTransformReduceInnermostDimIndex``, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change. On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec. Pull Request resolved: pytorch/pytorch#21214 Reviewed By: bddppq Differential Revision: D15584338 Pulled By: akyrola fbshipit-source-id: a08378baa018dbd0b45d7aba661fc9aefd3791e0
Summary: When optimizing sparse tensor coalesce, I recognized that this kernel was taking bulk of the time (see PR pytorch#21214). It is used (at least) in the sparse tensor constructor to validate that the index tensor min/max indices are valid. This PR rewrites the kernel by using CUB reduction ,achieving about 16x speedup. With my benchmark for coalesce, before nvprof showed: ``` # GPU activities: 45.47% 2.42669s 101 24.027ms 23.862ms 28.968ms void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) # 45.41% 2.42386s 101 23.999ms 23.857ms 28.944ms void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) ``` ... after this: ``` GPU activities: 19.50% 154.92ms 101 1.5338ms 1.5285ms 1.5987ms void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) 19.45% 154.52ms 101 1.5299ms 1.5247ms 1.5933ms void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) ``` Pull Request resolved: pytorch#21295 Differential Revision: D15606873 fbshipit-source-id: e5bc86933efa44c36c3b2942114a04c20abd7700
Studied why sparse tensor coalesce was slow: issue #10757.
Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used
kernelTransformReduceInnermostDimIndex, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change.On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec.
Test: test/test_sparse.py.
I also fixed CMakeList which did not include .cuh's in the native/sparse/cuda directory.
We should also look into why tensor min/max is slow.