Skip to content

Conversation

@akyrola
Copy link

@akyrola akyrola commented May 31, 2019

Studied why sparse tensor coalesce was slow: issue #10757.

Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used kernelTransformReduceInnermostDimIndex, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change.

On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec.

Test: test/test_sparse.py.

I also fixed CMakeList which did not include .cuh's in the native/sparse/cuda directory.

# ==63940== Profiling application: python coalesce_bench.py
# ==63940== Profiling result:
#             Type  Time(%)      Time     Calls       Avg       Min       Max  Name
#  GPU activities:   45.47%  2.42669s       101  24.027ms  23.862ms  28.968ms  void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
#                    45.41%  2.42386s       101  23.999ms  23.857ms  28.944ms  void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
#                     2.84%  151.49ms       100  1.5149ms  1.5082ms  1.6950ms  void at::native::apply::coalesceValuesKernel<float, float>(long*, at::native::apply::coalesceValuesKernel<float, float>, float*, float, long, long, long)
#                     2.65%  141.65ms       900  157.39us  153.06us  162.31us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__merge_sort::MergeAgent<thrust::device_ptr<long>, thrust::device_ptr<long>, long, ThrustLTOp<long, bool=0>, thrust::detail::integral_constant<bool, bool=1>>, bool, thrust::device_ptr<long>, thrust::device_ptr<long>, long, long*, long*, ThrustLTOp<long, bool=0>, long*, long>(thrust::device_ptr<long>, thrust::device_ptr<long>, long, long, bool=0, ThrustLTOp<long, bool=0>, bool, bool=1, thrust::detail::integral_constant<bool, bool=1>)

We should also look into why tensor min/max is slow.

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! This is the correct fix.

@ezyang If the cmake change is done at the correct place, we should merge this.

Copy link
Contributor

@bddppq bddppq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@akyrola akyrola force-pushed the akyrola/coalesce branch from 6a9f57a to 0c37a2e Compare May 31, 2019 19:43
@pytorchbot pytorchbot added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators module: sparse Related to torch.sparse labels May 31, 2019
@akyrola akyrola force-pushed the akyrola/coalesce branch from 0c37a2e to c79fb30 Compare May 31, 2019 20:50
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akyrola has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 1, 2019
Summary:
Studied why sparse tensor coalesce was slow:  issue #10757.

Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used ``kernelTransformReduceInnermostDimIndex``, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change.

On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec.
Pull Request resolved: pytorch/pytorch#21214

Reviewed By: bddppq

Differential Revision: D15584338

Pulled By: akyrola

fbshipit-source-id: a08378baa018dbd0b45d7aba661fc9aefd3791e0
@facebook-github-bot
Copy link
Contributor

@akyrola merged this pull request in 678dc44.

akyrola pushed a commit to akyrola/pytorch that referenced this pull request Jun 5, 2019
Summary:
When optimizing sparse tensor coalesce, I recognized that this kernel was taking bulk of the time (see PR pytorch#21214). It is used (at least) in the sparse tensor constructor to validate that the index tensor min/max indices are valid.

This PR rewrites the kernel by using CUB reduction ,achieving about 16x speedup. With my benchmark for coalesce, before nvprof showed:
```
#  GPU activities:   45.47%  2.42669s       101  24.027ms  23.862ms  28.968ms  void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
#                    45.41%  2.42386s       101  23.999ms  23.857ms  28.944ms  void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
```

... after this:

```
 GPU activities:   19.50%  154.92ms       101  1.5338ms  1.5285ms  1.5987ms  void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
                   19.45%  154.52ms       101  1.5299ms  1.5247ms  1.5933ms  void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
```
Pull Request resolved: pytorch#21295

Differential Revision: D15606873

fbshipit-source-id: e5bc86933efa44c36c3b2942114a04c20abd7700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: sparse Related to torch.sparse

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants