[ROCm] Improve performance of reductions on 1D and 2D tensors. #137737

doru1004 · 2024-10-10T22:07:40Z

This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types:

1D tensors
2D tensors when reducing along dimension 0
2D tensors when reducing along dimension 1

Runtime reduction between 0 and 75% depending on tensor shape.

The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions).

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2024-10-10T22:07:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137737

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PRE-EMPTIVE] Experimenting with new runners linux.aws.a100 on inductor-perf-compare.yml

✅ You can merge normally! (3 Unrelated Failures)

As of commit a510da8 with merge base c272526 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 4, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
'test/inductor/test_aot_inductor_arrayref.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_custom_op_all_inputs_cpu_with_stack_allocation'
periodic / win-vs2019-cuda11.8-py3 / test (default, 3, 4, lf.windows.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_sdp_choice_type_dense_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-10-10T22:07:44Z

The committers listed above are authorized under a signed CLA.

✅ login: doru1004 / name: Gheorghe-Teodor Bercea (a510da8)

aten/src/ATen/native/cuda/Reduce.cuh

facebook-github-bot · 2024-10-18T01:55:25Z

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

doru1004 · 2024-10-22T15:42:29Z

This patch also addresses this performance issue here: #132964
It handles both 1D and 2D cases (along DIM 0 and DIM 1).

For the examples in the issue above the performance impact is as follows:
1D goes up by 8%
2D (along DIM 1) goes up by 61%
2D (along DIM 0) same performance as before

Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.

xw285cornell

Nice! Verified it improves some op from 15ms to 1.5ms.

jianyuh · 2024-10-23T04:16:56Z

@pytorchbot rebase

pytorchmergebot · 2024-10-23T04:18:16Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-10-23T04:18:19Z

Successfully rebased performance-tuning-upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout performance-tuning-upstream && git pull --rebase)

jerrymannil · 2024-10-23T20:11:04Z

@pytorchbot merge

pytorchmergebot · 2024-10-23T20:12:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-23T20:13:06Z

Merge failed

Reason: 2 jobs have failed, first few of them are: periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu), Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

jerrymannil · 2024-10-23T20:22:52Z

@pytorchbot rebase

pytorchmergebot · 2024-10-23T20:24:15Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-10-23T20:24:18Z

Successfully rebased performance-tuning-upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout performance-tuning-upstream && git pull --rebase)

jerrymannil · 2024-10-23T20:25:45Z

@jianyuh
We are hitting "Meta Internal-Only Changes Check" failure.
Can you help with resolving it ?

jerrymannil · 2024-10-23T21:26:57Z

@jianyuh We are hitting "Meta Internal-Only Changes Check" failure. Can you help with resolving it ?

Plz ignore. Its passing now after rebase

facebook-github-bot · 2024-10-24T03:33:57Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-10-24T03:35:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.

…ch#137737) This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types: - 1D tensors - 2D tensors when reducing along dimension 0 - 2D tensors when reducing along dimension 1 Runtime reduction between 0 and 75% depending on tensor shape. The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions). Pull Request resolved: pytorch#137737 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell

Release version of upstream PR [137737](pytorch#137737). This has added support for GPUs with smaller number of CUs. Will upstream the smaller CU optimization later, once it is baked in release branch.

doru1004 requested review from eqy and syed-ahmed as code owners October 10, 2024 22:07

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Oct 10, 2024

pytorchbot added the open source label Oct 10, 2024

eqy approved these changes Oct 11, 2024

View reviewed changes

doru1004 force-pushed the performance-tuning-upstream branch 2 times, most recently from 0e00ac8 to e23df4e Compare October 11, 2024 15:44

jerrymannil reviewed Oct 11, 2024

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Show resolved Hide resolved

jeffdaily approved these changes Oct 11, 2024

View reviewed changes

jerrymannil mentioned this pull request Oct 11, 2024

ROCm MI300X sum() way slower than H100 #132964

Closed

pruthvistony added rocm This tag is for PRs from ROCm team ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm labels Oct 11, 2024

pruthvistony approved these changes Oct 11, 2024

View reviewed changes

doru1004 force-pushed the performance-tuning-upstream branch from e23df4e to 993f17e Compare October 15, 2024 13:46

doru1004 force-pushed the performance-tuning-upstream branch 2 times, most recently from bb2bd99 to f62670b Compare October 22, 2024 15:58

doru1004 mentioned this pull request Oct 22, 2024

[ROCm] Improve performance of reductions on 1D and 2D tensors ROCm/pytorch#1645

Closed

jerrymannil mentioned this pull request Oct 22, 2024

Performance tuning sum reduce for 1D and 2D tensors ROCm/pytorch#1646

Merged

xw285cornell approved these changes Oct 22, 2024

View reviewed changes

pytorchmergebot force-pushed the performance-tuning-upstream branch from f62670b to ad01179 Compare October 23, 2024 04:18

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2024

pytorchmergebot added the merging label Oct 23, 2024

pytorchmergebot removed the merging label Oct 23, 2024

Improve performance of reductions on 1D and 2D tensors.

a510da8

pytorchmergebot force-pushed the performance-tuning-upstream branch from ad01179 to a510da8 Compare October 23, 2024 20:24

pytorchmergebot added the merging label Oct 24, 2024

pytorchmergebot closed this in e5c3d7a Oct 24, 2024

pytorchmergebot added Merged and removed merging labels Oct 24, 2024

[ROCm] Improve performance of reductions on 1D and 2D tensors. #137737

[ROCm] Improve performance of reductions on 1D and 2D tensors. #137737

Uh oh!

Conversation

doru1004 commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137737

❗ 1 Active SEVs

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

linux-foundation-easycla bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Oct 18, 2024

Uh oh!

doru1004 commented Oct 22, 2024

Uh oh!

xw285cornell left a comment

Choose a reason for hiding this comment

Uh oh!

jianyuh commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

jerrymannil commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge failed

Uh oh!

jerrymannil commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Uh oh!

jerrymannil commented Oct 23, 2024

Uh oh!

jerrymannil commented Oct 23, 2024

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

doru1004 commented Oct 10, 2024 •

edited

Loading

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 10, 2024 •

edited

Loading