[ROCm] Correct numerical issues in layer norm backwards kernel #140259

jataylo · 2024-11-11T10:33:01Z

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel cuLoadWriteStridedInputs which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (#87635) we truncated mean and rstd from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd

pytorch-bot · 2024-11-11T10:33:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140259

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 496b18a with merge base 565a794 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-11-11T17:43:16Z

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jataylo · 2024-11-11T20:36:59Z

@pytorchbot merge

pytorchmergebot · 2024-11-11T20:38:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ch#140259) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736)

…ch#140259) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh

… kernel (pytorch#140259) (#1766) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736)

… kernel (pytorch#140259) (#1767) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736) Fixes #ISSUE_NUMBER

… kernel (pytorch#140259) (#1766) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736)

… kernel (pytorch#140259) (#1767) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736) Fixes #ISSUE_NUMBER

[ROCm] Correct numerical issues in layer norm backwards kernel

496b18a

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Nov 11, 2024

pytorchbot added the open source label Nov 11, 2024

jataylo requested review from jeffdaily and malfet November 11, 2024 14:25

jataylo marked this pull request as ready for review November 11, 2024 14:26

jataylo requested review from eqy and syed-ahmed as code owners November 11, 2024 14:26

jataylo added the rocm priority high priority ROCm PRs from performance or other aspects label Nov 11, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 11, 2024

jianyuh approved these changes Nov 11, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 11, 2024

pytorchmergebot added the Merged label Nov 11, 2024

pytorchmergebot closed this in 001f736 Nov 11, 2024

pytorchmergebot removed the merging label Nov 11, 2024

eppane mentioned this pull request Nov 13, 2024

Check the state of temporary solution when collecting gpu metrics nod-ai/serve#17

Open

This was referenced Dec 4, 2024

[release/2.3] [ROCm] Correct numerical issues in layer norm backwards kernel (#140259) ROCm/pytorch#1766

Merged

[release/2.2] [ROCm] Correct numerical issues in layer norm backwards kernel (#140259) ROCm/pytorch#1767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Correct numerical issues in layer norm backwards kernel #140259

[ROCm] Correct numerical issues in layer norm backwards kernel #140259

Uh oh!

jataylo commented Nov 11, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 11, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Nov 11, 2024

Uh oh!

jataylo commented Nov 11, 2024

Uh oh!

pytorchmergebot commented Nov 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[ROCm] Correct numerical issues in layer norm backwards kernel #140259

[ROCm] Correct numerical issues in layer norm backwards kernel #140259

Uh oh!

Conversation

jataylo commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140259

✅ No Failures

Uh oh!

facebook-github-bot commented Nov 11, 2024

Uh oh!

jataylo commented Nov 11, 2024

Uh oh!

pytorchmergebot commented Nov 11, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jataylo commented Nov 11, 2024 •

edited

Loading

pytorch-bot bot commented Nov 11, 2024 •

edited

Loading