[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup #21214

akyrola · 2019-05-31T18:45:57Z

Studied why sparse tensor coalesce was slow: issue #10757.

Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used kernelTransformReduceInnermostDimIndex, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change.

On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec.

Test: test/test_sparse.py.

I also fixed CMakeList which did not include .cuh's in the native/sparse/cuda directory.

# ==63940== Profiling application: python coalesce_bench.py
# ==63940== Profiling result:
#             Type  Time(%)      Time     Calls       Avg       Min       Max  Name
#  GPU activities:   45.47%  2.42669s       101  24.027ms  23.862ms  28.968ms  void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
#                    45.41%  2.42386s       101  23.999ms  23.857ms  28.944ms  void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long)
#                     2.84%  151.49ms       100  1.5149ms  1.5082ms  1.6950ms  void at::native::apply::coalesceValuesKernel<float, float>(long*, at::native::apply::coalesceValuesKernel<float, float>, float*, float, long, long, long)
#                     2.65%  141.65ms       900  157.39us  153.06us  162.31us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__merge_sort::MergeAgent<thrust::device_ptr<long>, thrust::device_ptr<long>, long, ThrustLTOp<long, bool=0>, thrust::detail::integral_constant<bool, bool=1>>, bool, thrust::device_ptr<long>, thrust::device_ptr<long>, long, long*, long*, ThrustLTOp<long, bool=0>, long*, long>(thrust::device_ptr<long>, thrust::device_ptr<long>, long, long, bool=0, ThrustLTOp<long, bool=0>, bool, bool=1, thrust::detail::integral_constant<bool, bool=1>)

We should also look into why tensor min/max is slow.

ssnl

Thanks a lot! This is the correct fix.

@ezyang If the cmake change is done at the correct place, we should merge this.

bddppq

LGTM!

aten/src/ATen/CMakeLists.txt

aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu

facebook-github-bot

@akyrola has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Studied why sparse tensor coalesce was slow: issue #10757. Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used ``kernelTransformReduceInnermostDimIndex``, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change. On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec. Pull Request resolved: pytorch/pytorch#21214 Reviewed By: bddppq Differential Revision: D15584338 Pulled By: akyrola fbshipit-source-id: a08378baa018dbd0b45d7aba661fc9aefd3791e0

facebook-github-bot · 2019-06-01T02:42:38Z

@akyrola merged this pull request in 678dc44.

Summary: When optimizing sparse tensor coalesce, I recognized that this kernel was taking bulk of the time (see PR pytorch#21214). It is used (at least) in the sparse tensor constructor to validate that the index tensor min/max indices are valid. This PR rewrites the kernel by using CUB reduction ,achieving about 16x speedup. With my benchmark for coalesce, before nvprof showed: ``` # GPU activities: 45.47% 2.42669s 101 24.027ms 23.862ms 28.968ms void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) # 45.41% 2.42386s 101 23.999ms 23.857ms 28.944ms void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) ``` ... after this: ``` GPU activities: 19.50% 154.92ms 101 1.5338ms 1.5285ms 1.5987ms void kernelTransformReduceInnermostDimIndex<long, long, MinValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) 19.45% 154.52ms 101 1.5299ms 1.5247ms 1.5933ms void kernelTransformReduceInnermostDimIndex<long, long, MaxValuePair<long, long>>(long*, long*, long*, unsigned int, unsigned int, thrust::pair<long, long>, long) ``` Pull Request resolved: pytorch#21295 Differential Revision: D15606873 fbshipit-source-id: e5bc86933efa44c36c3b2942114a04c20abd7700

ssnl reviewed May 31, 2019

View reviewed changes

bddppq approved these changes May 31, 2019

View reviewed changes

aten/src/ATen/CMakeLists.txt Outdated Show resolved Hide resolved

aten/src/ATen/native/sparse/cuda/SparseCUDATensor.cu Outdated Show resolved Hide resolved

akyrola force-pushed the akyrola/coalesce branch from 6a9f57a to 0c37a2e Compare May 31, 2019 19:43

pytorchbot added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators module: sparse Related to torch.sparse labels May 31, 2019

[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup

c79fb30

akyrola force-pushed the akyrola/coalesce branch from 0c37a2e to c79fb30 Compare May 31, 2019 20:50

facebook-github-bot reviewed May 31, 2019

View reviewed changes

facebook-github-bot closed this in 678dc44 Jun 1, 2019

facebook-github-bot added the merged label Jun 1, 2019

This was referenced Jun 3, 2019

[cuda] faster kernelTransformReduceInnermostDimIndex using cub #21292

Closed

[cuda] faster kernelTransformReduceInnermostDimIndex using cub #21295

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup #21214

[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup #21214

Uh oh!

akyrola commented May 31, 2019 •

edited

Loading

Uh oh!

ssnl left a comment

Uh oh!

bddppq left a comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup #21214

[sparse] use _sparse_coo_tensor_unsafe in coalesce for speedup #21214

Uh oh!

Conversation

akyrola commented May 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

bddppq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

akyrola commented May 31, 2019 •

edited

Loading