Skip to content

Conversation

@etaf
Copy link
Collaborator

@etaf etaf commented Sep 19, 2025

Stack from ghstack (oldest at bottom):

On the Inductor XPU backend, threads_per_warp is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

…rnel

for launching kernel correctly in cpp wrapper.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163315

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 51c2cfc with merge base ed3438f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

etaf added a commit that referenced this pull request Sep 19, 2025
…rnel

for launching kernel correctly in cpp wrapper.

ghstack-source-id: 54b133b
Pull Request resolved: #163315
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Sep 19, 2025
@etaf etaf changed the title [Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel [Inductor][Intel GPU] Save threads_per_warp from tirton compiled kernel for launching kernel correctly in cpp wrapper. Sep 19, 2025
void** params,
sycl::queue* queuePtr) {
sycl::queue* queuePtr,
uint32_t threadsPerWarp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This order looks better to me.

uint32_t gridX,
uint32_t gridY,
uint32_t gridZ,
uint32_t numWarps,
uint32_t threadsPerWarp,
uint32_t sharedMemory,
void** params,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the order is better, but the code generation of these parameters are shared with cuda, so I would prefer keep the theadPerWarp as an extra paramter here.

@etaf etaf added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2025
@etaf etaf requested review from desertfire and jansel September 19, 2025 08:40
@etaf
Copy link
Collaborator Author

etaf commented Sep 19, 2025

Hi @jansel @desertfire, could you please take a look at this PR when you have time? We’d like to get this fix cherry-picked into the 2.9 release. Thanks!

# can launch the kernel with the correct configuration.
threads_per_warp = 32
if hasattr(launcher.bin.metadata, "threads_per_warp"):
threads_per_warp = launcher.bin.metadata.threads_per_warp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You can drop if-checking and use getattr with 32 as the default value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desertfire : Thank you very much for your suggestion. I’ve simplified this piece of code into a single line.

…compiled kernel for launching kernel correctly in cpp wrapper."


On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.
 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
etaf added a commit that referenced this pull request Sep 19, 2025
…rnel

for launching kernel correctly in cpp wrapper.

ghstack-source-id: c56d814
Pull Request resolved: #163315
@EikanWang
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@etaf
Copy link
Collaborator Author

etaf commented Sep 19, 2025

@pytorchbot cherry-pick --onto release/2.9 -c "Critical - Critical fixes to new features"

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 19, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: argument -c/--classification: invalid choice: 'Critical - Critical fixes to new features' (choose from 'regression', 'critical', 'fixnewfeature', 'docs', 'release')

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

@etaf
Copy link
Collaborator Author

etaf commented Sep 19, 2025

@pytorchbot cherry-pick --onto release/2.9 -c fixnewfeature

pytorchbot pushed a commit that referenced this pull request Sep 20, 2025
…rnel for launching kernel correctly in cpp wrapper. (#163315)

On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: #163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire

(cherry picked from commit 9f8a311)
@pytorchbot
Copy link
Collaborator

Cherry picking #163315

The cherry pick PR is at #163388 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…rnel for launching kernel correctly in cpp wrapper. (pytorch#163315)

On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: pytorch#163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…rnel for launching kernel correctly in cpp wrapper. (pytorch#163315)

On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: pytorch#163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
atalman pushed a commit that referenced this pull request Sep 26, 2025
…rnel for launching kernel correctly in cpp wrapper. (#163388)

[Inductor][Intel GPU] Save `threads_per_warp` from tirton compiled kernel for launching kernel correctly in cpp wrapper. (#163315)

On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: #163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire

(cherry picked from commit 9f8a311)

Co-authored-by: xinan.lin <[email protected]>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…rnel for launching kernel correctly in cpp wrapper. (pytorch#163315)

On the Inductor XPU backend, `threads_per_warp` is not always 32. For Intel GEMM Triton kernels, it can be 16. This information must be preserved for XPU so that the Cpp wrapper can launch the kernel with the correct configuration.

Pull Request resolved: pytorch#163315
Approved by: https://github.com/EikanWang, https://github.com/desertfire
@github-actions github-actions bot deleted the gh/etaf/170/head branch October 20, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants