Improve performance of advanced indexing backward #20557

ngimel · 2019-05-15T22:25:53Z

This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100:

Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1
Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096

after:

Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096
Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096

Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than 65K*128, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported.
The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index.
There are ways to make sorting overhead smaller (thanks @mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls.
I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent.
This PR also fixes #20457 by casting indices to self backend.

…, move functions from Indexing.h

…han zeros_like

ngimel · 2019-05-15T23:13:05Z

CI today is failing with

May 15 22:32:24 CMake Error at third_party/fbgemm/CMakeLists.txt:115 (add_subdirectory):
May 15 22:32:24   The source directory
May 15 22:32:24 
May 15 22:32:24     /var/lib/jenkins/workspace/third_party/fbgemm/third_party/asmjit
May 15 22:32:24 
May 15 22:32:24   does not contain a CMakeLists.txt file.
May 15 22:32:24

(also in another PR), looks like fbgemm submodule is not recursively updated. Can I do something on my side? I've just merged master, so I don't think rebase would help?

ezyang · 2019-05-16T13:20:47Z

@pytorchbot rebase this please

should be fixed on master

ezyang · 2019-05-16T13:28:18Z

I'll let @colesbury take a first whack at this.

ngimel · 2019-05-20T17:59:22Z

I've pushed what I hope is a fix for rocm failures (and removed tabs), but I think same failures should be happening in embedding backward, because in some places WARP_SIZE is assumed to be 32, in some places a device-dependent #define'd WARP_SIZE is used. Is embedding backward not compiled/tested on rocm?

ezyang · 2019-05-21T18:15:08Z

cc @iotamudelta @bddppq re Natalia's question

ezyang · 2019-05-21T18:15:52Z

@colesbury Let me know if you want me to attempt a review (the PR seems pretty involved, heh heh)

iotamudelta · 2019-05-21T18:20:15Z

The fix for ROCm to change from assuming WARP_SIZE 32 to 64 makes sense. Embedding is certainly compiled in ROCm (it's not in the skipped functins), cannot comment on tests off the top of my head. Thanks!

iotamudelta · 2019-05-21T18:23:50Z

@ezyang maybe we should aim to have one warp size defined somewhere based on whatever architecture we compile for and that being reused everywhere as opposed to the wild ifdef'ing and hardcoding that has happened so far? What header would be a good place for something like this to go?

ezyang · 2019-05-21T18:41:19Z

we should aim to have one warp size defined somewhere based on whatever architecture we compile for and that being reused everywhere as opposed to the wild ifdef'ing and hardcoding that has happened so far? What header would be a good place for something like this to go?

Yes, this sounds reasonable. Maybe something like c10/macros/Macros.h or c10/cuda/CUDAMacros.h

iotamudelta · 2019-05-21T18:43:48Z

For obvious reasons I prefer c10/macros/Macros.h - I can get started on that.

ngimel · 2019-05-28T20:52:23Z

Yes, it's wrapIndexOnce, that runs a few (on the order of 10) launch latency bound kernels and adding 20-30 us. I've just done kwarg, but I'll redo as separate API. asserts in the device code in forward are not a problem.

deapproving in anticipation of unsafe indexing changes

ngimel · 2019-05-29T16:41:38Z

Added non-user-facing unsafe index_put option, addressed @ezyang's comments. I still plead copy paste.

ezyang · 2019-05-29T20:55:35Z

@pytorchbot rebase this please

ezyang · 2019-05-29T20:56:00Z

Waiting on CI to land

ngimel · 2019-05-30T18:07:40Z

@pytorchbot rebase this please

ngimel · 2019-05-31T03:46:26Z

@pytorchbot retest this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-03T19:17:27Z

@ezyang merged this pull request in ad971a3.

Summary: This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100: ``` Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1 Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096 ``` after: ``` Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096 Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096 ``` Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than `65K*128`, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported. The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index. There are ways to make sorting overhead smaller (thanks mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls. I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent. This PR also fixes #20457 by casting indices to `self` backend. Pull Request resolved: pytorch/pytorch#20557 Differential Revision: D15582434 Pulled By: ezyang fbshipit-source-id: 91e8f2769580588ec7d18823d99a26f1c0da8e2a

This reverts commit ad971a3. Fixes #22843. The revert also adds a test for this case. Hopefully we can find a real fix for this and don't have to revert the commit, but I'm posting this PR in case we cannot find a fix in time for release.

This reverts commit ad971a3. Fixes #22843. The revert also adds a test for this case. Hopefully we can find a real fix for this and don't have to revert the commit, but I'm posting this PR in case we cannot find a fix in time for release. ghstack-source-id: 443d434 Pull Request resolved: #23102

Natalia Gimelshein added 11 commits April 26, 2019 05:21

add cuda indexing

cda2470

change to custom comparator

551e67d

Merge branch 'master' into indexing

7e46ac0

cleanup

36a21c3

don't range-check indices in backward

d304298

use dispatch stubs instead of device-specific dispatch in native.yaml…

db304cf

…, move functions from Indexing.h

fix empty cases

f1e279f

move stuff from IndexingUtils.h, add tests

fee0f1e

handle 65K grid size limit, const qualifiers and auto types

8de3344

Merge branch 'master' into indexing

76ec37b

use TORCH_INTERNAL_ASSERT

6aa2f40

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels May 15, 2019

src tensor is not necessarily zero, call .contiguous() on it rather t…

39a58ee

…han zeros_like

ngimel requested review from colesbury and ezyang May 15, 2019 23:13

Merge remote-tracking branch 'origin/master' into HEAD

ed659b2

li-roy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 16, 2019

try to fix ROCm, remove tabs

5484992

ezyang added the module: rocm AMD GPU support for Pytorch label May 21, 2019

add _index_put_unsafe_ for index backward

a243840

address review comments

e189626

pytorchbot added the module: internals Related to internal abstractions in c10 and ATen label May 29, 2019

Natalia Gimelshein added 2 commits May 29, 2019 13:09

add _index_put_impl_ to derivatives.yaml

cec63d2

Merge branch 'master' into indexing

07150bb

Merge remote-tracking branch 'origin/master' into HEAD

6e56987

Natalia Gimelshein added 2 commits May 29, 2019 14:18

fix merge conflict

d24df86

Merge branch 'indexing' of github.com:ngimel/pytorch into indexing

a2f0726

Merge remote-tracking branch 'origin/master' into HEAD

6df305c

facebook-github-bot reviewed May 31, 2019

View reviewed changes

facebook-github-bot closed this in ad971a3 Jun 3, 2019

facebook-github-bot added the merged label Jun 3, 2019

ezyang added the open source label Jun 24, 2019

ezyang mentioned this pull request Jul 19, 2019

Zero gradients beyond a certain buffer size on CUDA #22843

Closed

ezyang mentioned this pull request Jul 19, 2019

Revert "Improve performance of advanced indexing backward (#20557)" #23102

Closed

albanD mentioned this pull request Dec 29, 2019

index_put_ does no longer work with int64 tensor and cuda #31672

Closed

agolynski mentioned this pull request Feb 14, 2020

Issue with longe sequence #33345

Closed

mruberry added the Merged label Oct 28, 2020

Improve performance of advanced indexing backward #20557

Improve performance of advanced indexing backward #20557

Uh oh!

Conversation

ngimel commented May 15, 2019

Uh oh!

ngimel commented May 15, 2019

Uh oh!

ezyang commented May 16, 2019

Uh oh!

ezyang commented May 16, 2019

Uh oh!

ngimel commented May 20, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

iotamudelta commented May 21, 2019

Uh oh!

iotamudelta commented May 21, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

iotamudelta commented May 21, 2019

Uh oh!

ngimel commented May 28, 2019

Uh oh!

ngimel commented May 29, 2019

Uh oh!

ezyang commented May 29, 2019

Uh oh!

ezyang commented May 29, 2019

Uh oh!

ngimel commented May 30, 2019

Uh oh!

ngimel commented May 31, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants