-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ROCm] Improve performance of reductions on 1D and 2D tensors. #137737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137737
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (3 Unrelated Failures)As of commit a510da8 with merge base c272526 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
0e00ac8 to
e23df4e
Compare
e23df4e to
993f17e
Compare
|
@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
This patch also addresses this performance issue here: #132964 For the examples in the issue above the performance impact is as follows: |
bb2bd99 to
f62670b
Compare
Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.
xw285cornell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Verified it improves some op from 15ms to 1.5ms.
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
f62670b to
ad01179
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu), Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
ad01179 to
a510da8
Compare
|
@jianyuh |
Plz ignore. Its passing now after rebase |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.
Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.
…ch#137737) This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types: - 1D tensors - 2D tensors when reducing along dimension 0 - 2D tensors when reducing along dimension 1 Runtime reduction between 0 and 75% depending on tensor shape. The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions). Pull Request resolved: pytorch#137737 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell
…ch#137737) This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types: - 1D tensors - 2D tensors when reducing along dimension 0 - 2D tensors when reducing along dimension 1 Runtime reduction between 0 and 75% depending on tensor shape. The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions). Pull Request resolved: pytorch#137737 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell
Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.
Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.
This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types:
Runtime reduction between 0 and 75% depending on tensor shape.
The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions).
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd