[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135624

pytorchbot · 2024-09-10T21:54:38Z

This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform.
By increasing this parameter, it uses fewer threadblocks and improved the performance for large tensors.

Test:
Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s).

import torch
from triton.testing import do_bench

x = torch.randn(2**30, device='cuda')

ms = do_bench(lambda: x.sum(dim=-1))

bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9)

time_s = ms / 1000

bw_per_second = bandwidth_gbyte / time_s

print(bw_per_second)

Co-author: @carlobertolli @hongxiayang

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@carlobertolli

…d in reduce config (#135397) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: #135397 Approved by: https://github.com/eqy, https://github.com/malfet (cherry picked from commit eb38ee2)

pytorch-bot · 2024-09-10T21:54:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135624

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 70e2277 with merge base b7eb725 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
ModuleNotFoundError: No module named 'torch.version'

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

rocm / linux-focal-rocm6.1-py3.8 / test (default, 3, 6, linux.rocm.gpu.2) (gh) (trunk failure)
functorch/test_control_flow.py::TestControlFlow::test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda
rocm / linux-focal-rocm6.1-py3.8 / test (default, 4, 6, linux.rocm.gpu.2) (gh) (trunk failure)
inductor/test_loop_ordering.py::LoopOrderingTest::test_fp8_cast_and_t
rocm / linux-focal-rocm6.1-py3.8 / test (default, 5, 6, linux.rocm.gpu.2) (gh) (trunk failure)
inductor/test_flex_decoding.py::TestFlexDecoding::test_builtin_score_mods_bfloat16_score_mod0_head_dims0

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pruthvistony · 2024-09-12T05:45:13Z

@atalman ,
I am not sure if I can trigger a merge on a release branch PR.
Can you please help on this PR merge.

github-actions · 2024-11-11T14:36:03Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pruthvistony · 2024-11-18T05:31:31Z

closing the stale PR and also this change cant be pushed release/2.5 since its late.

pytorchbot requested review from eqy and syed-ahmed as code owners September 10, 2024 21:54

pytorchbot mentioned this pull request Sep 10, 2024

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135397

Closed

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Sep 10, 2024

pytorchbot added the open source label Sep 10, 2024

eqy approved these changes Sep 11, 2024

View reviewed changes

pruthvistony self-requested a review September 11, 2024 21:10

pruthvistony approved these changes Sep 12, 2024

View reviewed changes

hongxiayang approved these changes Sep 12, 2024

View reviewed changes

github-actions bot added the Stale label Nov 11, 2024

pruthvistony closed this Nov 18, 2024

github-actions bot deleted the cherry-pick-135397-by-pytorch_bot_bot_ branch December 19, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135624

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135624

Uh oh!

pytorchbot commented Sep 10, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 10, 2024 •

edited

Loading

Uh oh!

pruthvistony commented Sep 12, 2024

Uh oh!

github-actions bot commented Nov 11, 2024

Uh oh!

pruthvistony commented Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135624

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135624

Uh oh!

Conversation

pytorchbot commented Sep 10, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135624

❌ 1 New Failure, 3 Unrelated Failures

Uh oh!

pruthvistony commented Sep 12, 2024

Uh oh!

github-actions bot commented Nov 11, 2024

Uh oh!

pruthvistony commented Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorchbot commented Sep 10, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 10, 2024 •

edited

Loading