Skip to content

Conversation

@jataylo
Copy link
Collaborator

@jataylo jataylo commented Nov 11, 2024

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel cuLoadWriteStridedInputs which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (#87635) we truncated mean and rstd from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140259

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 496b18a with merge base 565a794 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Nov 11, 2024
@jataylo jataylo requested review from jeffdaily and malfet November 11, 2024 14:25
@jataylo jataylo marked this pull request as ready for review November 11, 2024 14:26
@jataylo jataylo added the rocm priority high priority ROCm PRs from performance or other aspects label Nov 11, 2024
@facebook-github-bot
Copy link
Contributor

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 11, 2024
@jataylo
Copy link
Collaborator Author

jataylo commented Nov 11, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jataylo added a commit to ROCm/pytorch that referenced this pull request Dec 4, 2024
…ch#140259)

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
jataylo added a commit to ROCm/pytorch that referenced this pull request Dec 4, 2024
…ch#140259)

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
jataylo added a commit to ROCm/pytorch that referenced this pull request Dec 4, 2024
…ch#140259)

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
jataylo added a commit to ROCm/pytorch that referenced this pull request Dec 4, 2024
…ch#140259)

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…ch#140259)

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Dec 6, 2024
… kernel (pytorch#140259) (#1766)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Dec 6, 2024
… kernel (pytorch#140259) (#1767)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)

Fixes #ISSUE_NUMBER
rocm-mici pushed a commit to ROCm/pytorch that referenced this pull request Dec 6, 2024
… kernel (pytorch#140259) (#1767)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)

Fixes #ISSUE_NUMBER
rocm-mici pushed a commit to ROCm/pytorch that referenced this pull request Dec 13, 2024
… kernel (pytorch#140259) (#1766)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
rocm-mici pushed a commit to ROCm/pytorch that referenced this pull request Dec 13, 2024
… kernel (pytorch#140259) (#1766)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
rocm-mici pushed a commit to ROCm/pytorch that referenced this pull request Dec 13, 2024
… kernel (pytorch#140259) (#1766)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)
rocm-mici pushed a commit to ROCm/pytorch that referenced this pull request Dec 13, 2024
… kernel (pytorch#140259) (#1767)

It was raised that the backwards layer norm on AMD was slightly off the
accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which
processes strided input and accumulates the partial gradients into
shared memory.

In this kernel (pytorch#87635) we
truncated `mean` and `rstd` from T_ACC type to T which causes numerical
issues in the warp buffers created in this kernel. This PR will use the
correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so
this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)

Fixes #ISSUE_NUMBER
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source release notes: cuda release notes category rocm priority high priority ROCm PRs from performance or other aspects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants