-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper.
#163315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rnel for launching kernel correctly in cpp wrapper. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163315
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 51c2cfc with merge base ed3438f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
threads_per_warp from tirton compiled kernelthreads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper.
| void** params, | ||
| sycl::queue* queuePtr) { | ||
| sycl::queue* queuePtr, | ||
| uint32_t threadsPerWarp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This order looks better to me.
uint32_t gridX,
uint32_t gridY,
uint32_t gridZ,
uint32_t numWarps,
uint32_t threadsPerWarp,
uint32_t sharedMemory,
void** params,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the order is better, but the code generation of these parameters are shared with cuda, so I would prefer keep the theadPerWarp as an extra paramter here.
|
Hi @jansel @desertfire, could you please take a look at this PR when you have time? We’d like to get this fix cherry-picked into the 2.9 release. Thanks! |
| # can launch the kernel with the correct configuration. | ||
| threads_per_warp = 32 | ||
| if hasattr(launcher.bin.metadata, "threads_per_warp"): | ||
| threads_per_warp = launcher.bin.metadata.threads_per_warp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You can drop if-checking and use getattr with 32 as the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@desertfire : Thank you very much for your suggestion. I’ve simplified this piece of code into a single line.
…compiled kernel for launching kernel correctly in cpp wrapper." On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.9 -c "Critical - Critical fixes to new features" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot cherry-pick --onto release/2.9 -c fixnewfeature |
…rnel for launching kernel correctly in cpp wrapper. (#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: #163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire (cherry picked from commit 9f8a311)
Cherry picking #163315The cherry pick PR is at #163388 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
…rnel for launching kernel correctly in cpp wrapper. (pytorch#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: pytorch#163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire
…rnel for launching kernel correctly in cpp wrapper. (pytorch#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: pytorch#163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire
…rnel for launching kernel correctly in cpp wrapper. (#163388) [Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: #163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire (cherry picked from commit 9f8a311) Co-authored-by: xinan.lin <[email protected]>
…rnel for launching kernel correctly in cpp wrapper. (pytorch#163315) On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration. Pull Request resolved: pytorch#163315 Approved by: https://github.com/EikanWang, https://github.com/desertfire
Stack from ghstack (oldest at bottom):
threads_per_warpfrom tirton compiled kernel for launching kernel correctly in cpp wrapper. #163315On the Inductor XPU backend,
threads_per_warpis not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben