updating upsampling bilinear2d kernel: #21879

jjsjann123 · 2019-06-17T22:34:02Z

faster atomicAdd trick for fp16 backward kernel
better launch configs for backward kernel
removed unnecessary buffer initialization for forward kernel

1. faster atomicAdd trick for fp16 backward kernel 2. better launch configs for backward kernel 3. removed unnecessary buffer initialization for forward kernel

jjsjann123 · 2019-06-18T00:15:03Z

Perf number measured similarly to #21694

faster atomicAdd helps with fp16 performance a lot (comparing the backward speedup between fp16 to fp32)
Updated launch configs performs better with larger NC and smaller spatial dimension.
Removing unnecessary initialization helped with forward time.

jjsjann123 · 2019-06-18T00:15:26Z

cc'ing @ngimel @ezyang

jjsjann123 · 2019-06-18T00:55:36Z

PyLinter error on a python script that doesn't exist... Seems to be unrelated.

ezyang · 2019-06-18T14:58:46Z

Same deal, deferring to @ngimel here

ngimel

Make sure grad_input is always contiguous (I think it is), in this case fast_atomic argument won't be needed.

aten/src/ATen/native/cuda/UpSampleBilinear2d.cu

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: 1. faster atomicAdd trick for fp16 backward kernel 2. better launch configs for backward kernel 3. removed unnecessary buffer initialization for forward kernel Pull Request resolved: pytorch/pytorch#21879 Differential Revision: D15898680 Pulled By: ezyang fbshipit-source-id: 1fc81e6c078f1538d82e4f36921b630499eb504f

facebook-github-bot · 2019-06-19T17:34:02Z

@ezyang merged this pull request in 056a033.

… 32 bit aligned (#44642) Summary: For #44206 and #42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`. As a prelude, I spotted a UB risk in `fastAtomicAdd`. I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`. It appears the original [fastAtomicAdd PR](#21879 (comment) discussion did not consider that case explicitly. I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.) Pull Request resolved: #44642 Reviewed By: mruberry Differential Revision: D23699820 Pulled By: ngimel fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd

Summary: Fixes #44206 This PR basically follows the diff in #21879 for upsampling bilinear. For the script provided in #44206 , on my 2070 super GPU, the total timing I got (time in second) | | non-amp | amp | |---|---|---| | before PR | 2.88 | 9.6 | | after PR | 1.5 | 1.6 | kernel time after PR | | time | kernel | | --- | --- | --- | | non-amp | 0.37 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<float, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, float*, float const*) ` | | amp | 0.61 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<c10::Half, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, c10::Half*, c10::Half const*)` | Pull Request resolved: #48675 Reviewed By: bdhirsh Differential Revision: D25284853 Pulled By: ngimel fbshipit-source-id: 30f0d92e73050edd36013ce528d2e131effa3542

Summary: Fixes pytorch#44206 This PR basically follows the diff in pytorch#21879 for upsampling bilinear. For the script provided in pytorch#44206 , on my 2070 super GPU, the total timing I got (time in second) | | non-amp | amp | |---|---|---| | before PR | 2.88 | 9.6 | | after PR | 1.5 | 1.6 | kernel time after PR | | time | kernel | | --- | --- | --- | | non-amp | 0.37 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<float, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, float*, float const*) ` | | amp | 0.61 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<c10::Half, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, c10::Half*, c10::Half const*)` | Pull Request resolved: pytorch#48675 Reviewed By: bdhirsh Differential Revision: D25284853 Pulled By: ngimel fbshipit-source-id: 30f0d92e73050edd36013ce528d2e131effa3542

updating upsampling bilinear2d kernel:

7482d4d

1. faster atomicAdd trick for fp16 backward kernel 2. better launch configs for backward kernel 3. removed unnecessary buffer initialization for forward kernel

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 17, 2019

ezyang added the open source label Jun 17, 2019

ezyang requested review from ezyang and ngimel June 18, 2019 14:58

ngimel approved these changes Jun 18, 2019

View reviewed changes

aten/src/ATen/native/cuda/UpSampleBilinear2d.cu Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/UpSampleBilinear2d.cu Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/UpSampleBilinear2d.cu Show resolved Hide resolved

ngimel reviewed Jun 18, 2019

View reviewed changes

aten/src/ATen/native/cuda/UpSampleBilinear2d.cu Outdated Show resolved Hide resolved

jjsjann123 added 2 commits June 18, 2019 14:55

addressing review comments

0669fef

addressing tensor alignment & contiguous

e07d00d

facebook-github-bot reviewed Jun 19, 2019

View reviewed changes

facebook-github-bot closed this in 056a033 Jun 19, 2019

facebook-github-bot added the merged label Jun 19, 2019

fmassa mentioned this pull request Jul 6, 2019

nn.functional.interpolate very slow for fp16 (half) precision inputs #12409

Closed

mcarilli mentioned this pull request Sep 14, 2020

Fix FP16 fastAtomicAdd for one case where tensor start address is not 32 bit aligned #44642

Closed

mruberry added the Merged label Oct 28, 2020

mcarilli mentioned this pull request Nov 30, 2020

Upsample with a trilinear interpolation works at least 10x slower using Mixed Precision than with FP32. #44206

Closed

xwang233 mentioned this pull request Dec 1, 2020

Use fastAtomicAdd in GPU upsampling trilinear #48675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

updating upsampling bilinear2d kernel: #21879

updating upsampling bilinear2d kernel: #21879

Uh oh!

jjsjann123 commented Jun 17, 2019

Uh oh!

jjsjann123 commented Jun 18, 2019

Uh oh!

jjsjann123 commented Jun 18, 2019

Uh oh!

jjsjann123 commented Jun 18, 2019

Uh oh!

ezyang commented Jun 18, 2019

Uh oh!

ngimel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

updating upsampling bilinear2d kernel: #21879

updating upsampling bilinear2d kernel: #21879

Uh oh!

Conversation

jjsjann123 commented Jun 17, 2019

Uh oh!

jjsjann123 commented Jun 18, 2019

Uh oh!

jjsjann123 commented Jun 18, 2019

Uh oh!

jjsjann123 commented Jun 18, 2019

Uh oh!

ezyang commented Jun 18, 2019

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants