[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

jgong5 · 2024-09-08T14:40:17Z

Stack from ghstack (oldest at bottom):

-> [inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

PR #132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224:
SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling
AUTOTUNE linear_unary(12544x3072, 768x3072, 768)
cpp_packed_gemm_2 2.9371 ms 100.0%
_linear_pointwise 3.1584 ms 93.0%

But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-09-08T14:40:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135438

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 780061e with merge base 042f2f7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 847b140 Pull Request resolved: #135438

jgong5 · 2024-09-09T01:59:18Z

torch/_inductor/codegen/cpp_gemm_template.py

-        for (int64_t mc = m_block_start; mc < m_block_end; mc += Mc_blocks) {
+        for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) {
+            const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread;
+            const int64_t mc = m_block_start + my_mc_block_id * Mc_blocks;


This is the core part of the change that allows different cores that share the same M blocks to load different chunks to mitigate memory synchronization cost.

jgong5 · 2024-09-09T02:02:27Z

@pytorchbot merge

pytorchmergebot · 2024-09-09T02:04:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ch#134686) (pytorch#135438) Fix pytorch#134686. PR pytorch#132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: pytorch#135438 Approved by: https://github.com/leslie-fang-intel

Update

75b4520

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 8, 2024

pytorchbot added the open source label Sep 8, 2024

jgong5 added the topic: not user facing topic category label Sep 8, 2024

jgong5 requested review from chunyuan-w and leslie-fang-intel September 8, 2024 14:51

leslie-fang-intel approved these changes Sep 9, 2024

View reviewed changes

Update

780061e

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Sep 9, 2024

[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686)

dae376e

ghstack-source-id: 847b140 Pull Request resolved: #135438

jgong5 commented Sep 9, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 9, 2024

pytorchmergebot added the merging label Sep 9, 2024

pytorchmergebot added the Merged label Sep 9, 2024

pytorchmergebot closed this in c0436c5 Sep 9, 2024

pytorchmergebot removed the merging label Sep 9, 2024

github-actions bot deleted the gh/jgong5/73/head branch October 12, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

Uh oh!

jgong5 commented Sep 8, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 8, 2024 •

edited

Loading

Uh oh!

jgong5 Sep 9, 2024

Uh oh!

jgong5 commented Sep 9, 2024

Uh oh!

pytorchmergebot commented Sep 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) #135438

Uh oh!

Conversation

jgong5 commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135438

✅ No Failures

Uh oh!

jgong5 Sep 9, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Sep 9, 2024

Uh oh!

pytorchmergebot commented Sep 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jgong5 commented Sep 8, 2024 •

edited

Loading

pytorch-bot bot commented Sep 8, 2024 •

edited

Loading