-
Notifications
You must be signed in to change notification settings - Fork 26.3k
support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F #154680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F #154680
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154680
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 4b651a9 with merge base 31405a6 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Cherry-pick of upstream pytorch#154680.
Cherry-pick of upstream pytorch#154680.
Cherry-pick of upstream pytorch#154680.
|
@pytorchbot rebase |
|
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Requires CUDA >= 12.9 and sm_90. hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
|
Successfully rebased |
dab977a to
4b651a9
Compare
|
@malfet you need to reimport after the rebase? |
|
@pytorchbot merge -f "unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available. I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode). I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups:   We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better. I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels. Pull Request resolved: #157905 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg
Requires CUDA >= 12.9 and sm_90.
hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.