TensorIterator cuda launch configs update #16224

jjsjann123 · 2019-01-22T05:48:50Z

Summary:
Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible
block dimension to improve efficiency for reduction cases with small fast
dimension.

Previously TensorIterator launches blocks with fixed 32x16 threads.
For cases like:

import torch
torch.randn(2**20, 4, device='cuda').sum(0)

The fixed launch config does handle coalesced memory access efficiently.

Updated launch configure enables flexible block dimension. Combining with
improved reduction scheme (using flexible vertical / horizontal reduction
instead of limited warp / block reduction in the old code), it ensures optimal
memory access pattern even with reduction on dimension with small stride.

Possible future improvements:

Precise dynamic shared memory allocation.
Using warp shuffle for vertical (block_y) reduction.

Summary: Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction.

jjsjann123 · 2019-01-22T06:11:24Z

This is my monkey benchmarking. Notice the perf gain for small fast changing dimensions.

jjsjann123 · 2019-01-22T06:12:18Z

For visibility @ngimel @umanwizard @colesbury

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

umanwizard · 2019-01-25T01:05:49Z

We are getting this error on some internal builds:

aten/src/ATen/native/cuda/Reduce.cuh:34:12: error: right shift count >= width of type [-Werror=shift-count-overflow]
   n |= (n >> 32);

umanwizard

see above comment

jjsjann123 · 2019-01-25T01:16:17Z

My bad.
I thought I updated that two commits ago when I was changing the data type from int64 to int. Guess it didn't happen.

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke · 2019-01-31T18:36:14Z

I'm somewhat worried about the slowdowns. Why are we getting rid of warp reductions? Aren't those supposed to be fast?

jjsjann123 · 2019-01-31T19:05:58Z

I'm not getting rid of warp reduction. Keeping them as necessary for the old launch config where block dimension as 32x16: https://github.com/pytorch/pytorch/pull/16224/files#diff-662693ef7b7f32fa32d7179b6614fc16R379

Renaming warp_reduce to block_x_reduce because as we have flexible block dimension. Given the cases where blockDim.x > 32, it requires a hybrid of shared memory reduction and warp reduction, because not all threads are in the same warp.

aten/src/ATen/native/cuda/Reduce.cuh

jjsjann123 · 2019-02-06T22:24:10Z

Test failure doesn't seem to be relevant. Merging ToT to see if it goes away

addressed

facebook-github-bot

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: pytorch/pytorch#16224 Differential Revision: D13806753 Pulled By: soumith fbshipit-source-id: 37e45c7767b5748cf9ecf894fad306e040e2f79f

soumith · 2019-02-08T19:41:20Z

just fyi @jjsjann123 this is being reverted, we are seeing "illegal memory exception" and RuntimeError: Creating MTGP constants failed. at caffe2/aten/src/THC/THCTensorRandom.cu:33 (when using torch.manual_seed) after this PR was landed in hard-to-repro internal workloads.

jjsjann123 · 2019-02-12T02:37:52Z

I think I figured out the issue. It's me overlooking the old heuristics for inter-block vs inter-warp reduction. Surprised this doesn't get caught by CI tests.

Working on a fix. Will also add tests that covers it.

update: 1. global_reduce check for should_block_y_reduce first. This avoids the enabling global_reduce without block_y_reduce. Leading to accessing shared memory during global reduce without allocation. 2. updating block_y_reduce heuristics. Improves perf on tiny tensors 3. adding test case covering old cases where illegal memory access might occur TensorIterator cuda launch configs update (pytorch#16224) Summary: Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: pytorch#16224

Summary: update: 1. global_reduce check for should_block_y_reduce first. This avoids the enabling global_reduce without block_y_reduce. Leading to accessing shared memory during global reduce without allocation. 2. updating block_y_reduce heuristics. Improves perf on tiny tensors 3. adding test case covering old cases where illegal memory access might occur TensorIterator cuda launch configs update (#16224) Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: #16224 Pull Request resolved: #17040 Differential Revision: D14078295 Pulled By: umanwizard fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c

Summary: update: 1. global_reduce check for should_block_y_reduce first. This avoids the enabling global_reduce without block_y_reduce. Leading to accessing shared memory during global reduce without allocation. 2. updating block_y_reduce heuristics. Improves perf on tiny tensors 3. adding test case covering old cases where illegal memory access might occur TensorIterator cuda launch configs update (#16224) Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: pytorch/pytorch#16224 Pull Request resolved: pytorch/pytorch#17040 Differential Revision: D14078295 Pulled By: umanwizard fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c

Summary: update: 1. global_reduce check for should_block_y_reduce first. This avoids the enabling global_reduce without block_y_reduce. Leading to accessing shared memory during global reduce without allocation. 2. updating block_y_reduce heuristics. Improves perf on tiny tensors 3. adding test case covering old cases where illegal memory access might occur TensorIterator cuda launch configs update (pytorch#16224) Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible block dimension to improve efficiency for reduction cases with small fast dimension. Previously TensorIterator launches blocks with fixed 32x16 threads. For cases like: import torch torch.randn(2**20, 4, device='cuda').sum(0) The fixed launch config does handle coalesced memory access efficiently. Updated launch configure enables flexible block dimension. Combining with improved reduction scheme (using flexible vertical / horizontal reduction instead of limited warp / block reduction in the old code), it ensures optimal memory access pattern even with reduction on dimension with small stride. Possible future improvements: 1. Precise dynamic shared memory allocation. 2. Using warp shuffle for vertical (block_y) reduction. Pull Request resolved: pytorch#16224 Pull Request resolved: pytorch#17040 Differential Revision: D14078295 Pulled By: umanwizard fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c

jjsjann123 added 2 commits January 23, 2019 00:04

setting min block dimension to 1x1 to avoid division by 0

da61a93

fixing diverged __syncthreads

73b15ac

umanwizard self-requested a review January 24, 2019 21:09

facebook-github-bot reviewed Jan 24, 2019

View reviewed changes

umanwizard previously requested changes Jan 25, 2019

View reviewed changes

removing 32 bit shift as previously changed data type from int64 to int

4b78cd8

facebook-github-bot reviewed Jan 25, 2019

View reviewed changes

fmassa mentioned this pull request Feb 4, 2019

PyTorch stable version facebookresearch/maskrcnn-benchmark#261

Open

iotamudelta reviewed Feb 5, 2019

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

jjsjann123 added 2 commits February 5, 2019 16:04

removing hard coded warp size of 32 with warpSize

51f487d

Merge remote-tracking branch 'origin/master' into TI_launch_config_PR

3d61f44

facebook-github-bot reviewed Feb 7, 2019

View reviewed changes

facebook-github-bot closed this in 49443d4 Feb 7, 2019

jjsjann123 mentioned this pull request Feb 13, 2019

Second PR to restore reverted commit (#16224) #17040

Closed

zou3519 mentioned this pull request Mar 12, 2019

Second PR to restore reverted commit (#16224) (#17040) zou3519/pytorch#7

Closed

ezyang added open source merged labels Jun 24, 2019

TensorIterator cuda launch configs update #16224

TensorIterator cuda launch configs update #16224

Uh oh!

Conversation

jjsjann123 commented Jan 22, 2019

Uh oh!

jjsjann123 commented Jan 22, 2019

Uh oh!

jjsjann123 commented Jan 22, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

umanwizard commented Jan 25, 2019

Uh oh!

umanwizard left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Jan 25, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke commented Jan 31, 2019

Uh oh!

jjsjann123 commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jjsjann123 commented Feb 6, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Feb 8, 2019

Uh oh!

jjsjann123 commented Feb 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jjsjann123 commented Jan 31, 2019 •

edited

Loading