Skip to content

Conversation

@jeffdaily
Copy link
Collaborator

Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.

@pytorch-bot
Copy link

pytorch-bot bot commented May 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154680

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 4b651a9 with merge base 31405a6 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeffdaily jeffdaily added release notes: rocm mandatorylabel release notes: cuda release notes category ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels May 29, 2025
@jeffdaily jeffdaily marked this pull request as ready for review June 2, 2025 21:11
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jun 3, 2025
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jun 3, 2025
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jun 3, 2025
@pruthvistony
Copy link
Collaborator

@pytorchbot rebase

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

jeffdaily added 5 commits June 3, 2025 22:00
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0.
Support the new enum early using a cmake test.
@pytorchmergebot
Copy link
Collaborator

Successfully rebased blaslt_matmul_matrix_scale_outer_vec_32f onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout blaslt_matmul_matrix_scale_outer_vec_32f && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the blaslt_matmul_matrix_scale_outer_vec_32f branch from dab977a to 4b651a9 Compare June 3, 2025 22:00
@jeffdaily
Copy link
Collaborator Author

@malfet you need to reimport after the rebase?

@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jul 11, 2025
Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available.

I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode).

I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups:
![image](https://github.com/user-attachments/assets/43dfb816-9ccf-40c5-8b2a-571ce9cb511d)
![image](https://github.com/user-attachments/assets/be7ac6f2-e16c-479b-ad5c-f8039caba4b1)

We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better.

I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels.

Pull Request resolved: #157905
Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category release notes: rocm mandatorylabel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants