-
Notifications
You must be signed in to change notification settings - Fork 26.3k
updating upsampling bilinear2d kernel: #21879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jjsjann123
commented
Jun 17, 2019
- faster atomicAdd trick for fp16 backward kernel
- better launch configs for backward kernel
- removed unnecessary buffer initialization for forward kernel
1. faster atomicAdd trick for fp16 backward kernel 2. better launch configs for backward kernel 3. removed unnecessary buffer initialization for forward kernel
|
Perf number measured similarly to #21694
|
|
PyLinter error on a python script that doesn't exist... Seems to be unrelated. |
|
Same deal, deferring to @ngimel here |
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure grad_input is always contiguous (I think it is), in this case fast_atomic argument won't be needed.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: 1. faster atomicAdd trick for fp16 backward kernel 2. better launch configs for backward kernel 3. removed unnecessary buffer initialization for forward kernel Pull Request resolved: pytorch/pytorch#21879 Differential Revision: D15898680 Pulled By: ezyang fbshipit-source-id: 1fc81e6c078f1538d82e4f36921b630499eb504f
… 32 bit aligned (#44642) Summary: For #44206 and #42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`. As a prelude, I spotted a UB risk in `fastAtomicAdd`. I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`. It appears the original [fastAtomicAdd PR](#21879 (comment) discussion did not consider that case explicitly. I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.) Pull Request resolved: #44642 Reviewed By: mruberry Differential Revision: D23699820 Pulled By: ngimel fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd
… 32 bit aligned (#44642) Summary: For #44206 and #42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`. As a prelude, I spotted a UB risk in `fastAtomicAdd`. I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`. It appears the original [fastAtomicAdd PR](#21879 (comment) discussion did not consider that case explicitly. I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.) Pull Request resolved: #44642 Reviewed By: mruberry Differential Revision: D23699820 Pulled By: ngimel fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd
Summary: Fixes #44206 This PR basically follows the diff in #21879 for upsampling bilinear. For the script provided in #44206 , on my 2070 super GPU, the total timing I got (time in second) | | non-amp | amp | |---|---|---| | before PR | 2.88 | 9.6 | | after PR | 1.5 | 1.6 | kernel time after PR | | time | kernel | | --- | --- | --- | | non-amp | 0.37 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<float, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, float*, float const*) ` | | amp | 0.61 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<c10::Half, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, c10::Half*, c10::Half const*)` | Pull Request resolved: #48675 Reviewed By: bdhirsh Differential Revision: D25284853 Pulled By: ngimel fbshipit-source-id: 30f0d92e73050edd36013ce528d2e131effa3542
Summary: Fixes pytorch#44206 This PR basically follows the diff in pytorch#21879 for upsampling bilinear. For the script provided in pytorch#44206 , on my 2070 super GPU, the total timing I got (time in second) | | non-amp | amp | |---|---|---| | before PR | 2.88 | 9.6 | | after PR | 1.5 | 1.6 | kernel time after PR | | time | kernel | | --- | --- | --- | | non-amp | 0.37 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<float, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, float*, float const*) ` | | amp | 0.61 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<c10::Half, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, c10::Half*, c10::Half const*)` | Pull Request resolved: pytorch#48675 Reviewed By: bdhirsh Differential Revision: D25284853 Pulled By: ngimel fbshipit-source-id: 30f0d92e73050edd36013ce528d2e131effa3542
