-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve performance of advanced indexing backward #20557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…, move functions from Indexing.h
|
CI today is failing with (also in another PR), looks like fbgemm submodule is not recursively updated. Can I do something on my side? I've just merged master, so I don't think rebase would help? |
|
@pytorchbot rebase this please should be fixed on master |
|
I'll let @colesbury take a first whack at this. |
|
I've pushed what I hope is a fix for rocm failures (and removed tabs), but I think same failures should be happening in embedding backward, because in some places WARP_SIZE is assumed to be 32, in some places a device-dependent #define'd WARP_SIZE is used. Is embedding backward not compiled/tested on rocm? |
|
cc @iotamudelta @bddppq re Natalia's question |
|
@colesbury Let me know if you want me to attempt a review (the PR seems pretty involved, heh heh) |
|
The fix for ROCm to change from assuming WARP_SIZE 32 to 64 makes sense. Embedding is certainly compiled in ROCm (it's not in the skipped functins), cannot comment on tests off the top of my head. Thanks! |
|
@ezyang maybe we should aim to have one warp size defined somewhere based on whatever architecture we compile for and that being reused everywhere as opposed to the wild ifdef'ing and hardcoding that has happened so far? What header would be a good place for something like this to go? |
Yes, this sounds reasonable. Maybe something like c10/macros/Macros.h or c10/cuda/CUDAMacros.h |
|
For obvious reasons I prefer c10/macros/Macros.h - I can get started on that. |
|
Yes, it's wrapIndexOnce, that runs a few (on the order of 10) launch latency bound kernels and adding 20-30 us. I've just done kwarg, but I'll redo as separate API. asserts in the device code in forward are not a problem. |
deapproving in anticipation of unsafe indexing changes
|
Added non-user-facing unsafe index_put option, addressed @ezyang's comments. I still plead copy paste. |
|
@pytorchbot rebase this please |
|
Waiting on CI to land |
|
@pytorchbot rebase this please |
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100: ``` Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1 Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096 ``` after: ``` Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096 Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096 ``` Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than `65K*128`, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported. The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index. There are ways to make sorting overhead smaller (thanks mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls. I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent. This PR also fixes #20457 by casting indices to `self` backend. Pull Request resolved: pytorch/pytorch#20557 Differential Revision: D15582434 Pulled By: ezyang fbshipit-source-id: 91e8f2769580588ec7d18823d99a26f1c0da8e2a
This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100:
after:
Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than
65K*128, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported.The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index.
There are ways to make sorting overhead smaller (thanks @mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls.
I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent.
This PR also fixes #20457 by casting indices to
selfbackend.