Skip to content

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented May 15, 2019

This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100:

Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1
Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096

after:

Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096
Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096

Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than 65K*128, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported.
The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index.
There are ways to make sorting overhead smaller (thanks @mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls.
I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent.
This PR also fixes #20457 by casting indices to self backend.

@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels May 15, 2019
@ngimel
Copy link
Collaborator Author

ngimel commented May 15, 2019

CI today is failing with

May 15 22:32:24 CMake Error at third_party/fbgemm/CMakeLists.txt:115 (add_subdirectory):
May 15 22:32:24   The source directory
May 15 22:32:24 
May 15 22:32:24     /var/lib/jenkins/workspace/third_party/fbgemm/third_party/asmjit
May 15 22:32:24 
May 15 22:32:24   does not contain a CMakeLists.txt file.
May 15 22:32:24 

(also in another PR), looks like fbgemm submodule is not recursively updated. Can I do something on my side? I've just merged master, so I don't think rebase would help?

@ngimel ngimel requested review from colesbury and ezyang May 15, 2019 23:13
@ezyang
Copy link
Contributor

ezyang commented May 16, 2019

@pytorchbot rebase this please

should be fixed on master

@ezyang
Copy link
Contributor

ezyang commented May 16, 2019

I'll let @colesbury take a first whack at this.

@li-roy li-roy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 16, 2019
@ngimel
Copy link
Collaborator Author

ngimel commented May 20, 2019

I've pushed what I hope is a fix for rocm failures (and removed tabs), but I think same failures should be happening in embedding backward, because in some places WARP_SIZE is assumed to be 32, in some places a device-dependent #define'd WARP_SIZE is used. Is embedding backward not compiled/tested on rocm?

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

cc @iotamudelta @bddppq re Natalia's question

@ezyang ezyang added the module: rocm AMD GPU support for Pytorch label May 21, 2019
@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

@colesbury Let me know if you want me to attempt a review (the PR seems pretty involved, heh heh)

@iotamudelta
Copy link
Contributor

The fix for ROCm to change from assuming WARP_SIZE 32 to 64 makes sense. Embedding is certainly compiled in ROCm (it's not in the skipped functins), cannot comment on tests off the top of my head. Thanks!

@iotamudelta
Copy link
Contributor

@ezyang maybe we should aim to have one warp size defined somewhere based on whatever architecture we compile for and that being reused everywhere as opposed to the wild ifdef'ing and hardcoding that has happened so far? What header would be a good place for something like this to go?

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

we should aim to have one warp size defined somewhere based on whatever architecture we compile for and that being reused everywhere as opposed to the wild ifdef'ing and hardcoding that has happened so far? What header would be a good place for something like this to go?

Yes, this sounds reasonable. Maybe something like c10/macros/Macros.h or c10/cuda/CUDAMacros.h

@iotamudelta
Copy link
Contributor

For obvious reasons I prefer c10/macros/Macros.h - I can get started on that.

@ngimel
Copy link
Collaborator Author

ngimel commented May 28, 2019

Yes, it's wrapIndexOnce, that runs a few (on the order of 10) launch latency bound kernels and adding 20-30 us. I've just done kwarg, but I'll redo as separate API. asserts in the device code in forward are not a problem.

@ezyang ezyang dismissed their stale review May 29, 2019 11:32

deapproving in anticipation of unsafe indexing changes

@pytorchbot pytorchbot added the module: internals Related to internal abstractions in c10 and ATen label May 29, 2019
@ngimel
Copy link
Collaborator Author

ngimel commented May 29, 2019

Added non-user-facing unsafe index_put option, addressed @ezyang's comments. I still plead copy paste.

@ezyang
Copy link
Contributor

ezyang commented May 29, 2019

@pytorchbot rebase this please

@ezyang
Copy link
Contributor

ezyang commented May 29, 2019

Waiting on CI to land

@ngimel
Copy link
Collaborator Author

ngimel commented May 30, 2019

@pytorchbot rebase this please

@ngimel
Copy link
Collaborator Author

ngimel commented May 31, 2019

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in ad971a3.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 3, 2019
Summary:
This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100:
```
Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1
Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096
```
after:
```
Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096
Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096
```
Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than `65K*128`, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported.
The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index.
There are ways to make sorting overhead smaller (thanks mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls.
I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent.
This PR also fixes #20457 by casting indices to `self` backend.
Pull Request resolved: pytorch/pytorch#20557

Differential Revision: D15582434

Pulled By: ezyang

fbshipit-source-id: 91e8f2769580588ec7d18823d99a26f1c0da8e2a
ezyang added a commit that referenced this pull request Jul 19, 2019
This reverts commit ad971a3.

Fixes #22843.  The revert also adds a test for this case.

Hopefully we can find a real fix for this and don't have to revert
the commit, but I'm posting this PR in case we cannot find a fix
in time for release.
ezyang added a commit that referenced this pull request Jul 19, 2019
This reverts commit ad971a3.

Fixes #22843.  The revert also adds a test for this case.

Hopefully we can find a real fix for this and don't have to revert
the commit, but I'm posting this PR in case we cannot find a fix
in time for release.

ghstack-source-id: 443d434
Pull Request resolved: #23102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: rocm AMD GPU support for Pytorch open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index_put_ no longer accepts indices with non-matching backend

9 participants