[inductor][cpp][gemm] support k slicing for static shapes #130821

jgong5 · 2024-07-16T08:20:23Z

Stack from ghstack (oldest at bottom):

This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by local_buf_ptrs which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output Y. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.

The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.

Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.

Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
cpp_packed_gemm_0 0.0108 ms 100.0%
_linear_pointwise 0.0431 ms 25.1%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
cpp_packed_gemm_0 0.0272 ms 100.0%
_linear_pointwise 0.0892 ms 30.5%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
cpp_packed_gemm_0 0.0781 ms 100.0%
_linear_pointwise 0.1693 ms 46.1%

With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
cpp_packed_gemm_0 0.0260 ms 100.0%
_linear_pointwise 0.0444 ms 58.5%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
cpp_packed_gemm_0 0.0275 ms 100.0%
_linear_pointwise 0.0893 ms 30.8%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
cpp_packed_gemm_0 0.0284 ms 100.0%
_linear_pointwise 0.1686 ms 16.8%

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-07-16T08:20:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130821

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7d7befe with merge base 1614891 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: fa224b6 Pull Request resolved: #130821

[ghstack-poisoned]

ghstack-source-id: 3ccfa1f Pull Request resolved: #130821

[ghstack-poisoned]

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

[ghstack-poisoned]

ghstack-source-id: 215a9fb Pull Request resolved: #130821

leslie-fang-intel · 2024-07-24T02:28:50Z

torch/_inductor/codegen/cpp_gemm_template.py

+            return False
+        if self.is_dynamic_M:
+            # TODO(jgong5): perhaps use size hint to decide?
+            return True


since num_k_slices is 1 for dynamic M, so, anyway k-slicing will not work for this case.

But I don't want to generate k-slicing related code when it is dynamic M for now.

jgong5 · 2024-07-24T22:17:12Z

@pytorchbot merge

pytorchmergebot · 2024-07-24T22:19:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-25T04:17:50Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jgong5 · 2024-07-25T06:51:27Z

@pytorchbot merge

pytorchmergebot · 2024-07-25T06:54:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-25T12:52:45Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jgong5 · 2024-07-25T13:34:43Z

@pytorchbot merge

pytorchmergebot · 2024-07-25T13:36:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

bc3dabc

[ghstack-poisoned]

jgong5 mentioned this pull request Jul 16, 2024

[inductor][cpp] align dtype convert cache between vec and scalar kernels #130677

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Jul 16, 2024

This was referenced Jul 16, 2024

[inductor][cpp][gemm] move bias add to epilogue #130675

Closed

[inductor][cpp][gemm] optimize arbitrary N in packed gemm template #130690

Closed

jgong5 marked this pull request as draft July 16, 2024 08:20

Update

81ca176

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Jul 16, 2024

[inductor][cpp][gemm] support k slicing

174b76f

ghstack-source-id: fa224b6 Pull Request resolved: #130821

pytorchbot added the open source label Jul 16, 2024

Update

f6be4ee

[ghstack-poisoned]

This was referenced Jul 18, 2024

[inductor] [cpp] improve cache blocking with CPU info #129348

Closed

[inductor][cpp][gemm] improve thread blocking heuristics #131024

Closed

jgong5 pushed a commit that referenced this pull request Jul 18, 2024

[inductor][cpp][gemm] support k slicing

6548c70

ghstack-source-id: 3ccfa1f Pull Request resolved: #130821

Update

3a2170a

[ghstack-poisoned]

Update

581d57f

[ghstack-poisoned]

Jiong Gong added 2 commits July 21, 2024 16:48

Update on "[inductor][cpp][gemm] support k slicing"

c2d14af

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Update on "[inductor][cpp][gemm] support k slicing for static shapes"

81ad296

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

jgong5 changed the title ~~[inductor][cpp][gemm] support k slicing~~ [inductor][cpp][gemm] support k slicing for static shapes Jul 21, 2024

Update

3551da5

[ghstack-poisoned]

Update

8073bd3

[ghstack-poisoned]

Update

1ca39b6

[ghstack-poisoned]

Update

297d4a7

[ghstack-poisoned]

jgong5 marked this pull request as ready for review July 23, 2024 14:32

jgong5 requested review from chunyuan-w, jansel and leslie-fang-intel July 23, 2024 14:57

Update

7d7befe

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Jul 23, 2024

[inductor][cpp][gemm] support k slicing for static shapes

6c1f313

ghstack-source-id: 215a9fb Pull Request resolved: #130821

jgong5 added the topic: not user facing topic category label Jul 24, 2024

leslie-fang-intel reviewed Jul 24, 2024

View reviewed changes

leslie-fang-intel approved these changes Jul 24, 2024

View reviewed changes

jansel approved these changes Jul 24, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 24, 2024

pytorchmergebot added the merging label Jul 24, 2024

pytorchmergebot added the Merged label Jul 25, 2024

pytorchmergebot closed this in 316c0d3 Jul 25, 2024

pytorchmergebot removed the merging label Jul 25, 2024

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

jgong5 mentioned this pull request Aug 24, 2024

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Open

18 tasks

github-actions bot deleted the gh/jgong5/60/head branch August 25, 2024 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor][cpp][gemm] support k slicing for static shapes #130821

[inductor][cpp][gemm] support k slicing for static shapes #130821

Uh oh!

jgong5 commented Jul 16, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 16, 2024 •

edited

Loading

Uh oh!

leslie-fang-intel Jul 24, 2024

Uh oh!

jgong5 Jul 24, 2024

Uh oh!

jgong5 commented Jul 24, 2024

Uh oh!

pytorchmergebot commented Jul 24, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

jgong5 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

jgong5 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[inductor][cpp][gemm] support k slicing for static shapes #130821

[inductor][cpp][gemm] support k slicing for static shapes #130821

Uh oh!

Conversation

jgong5 commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130821

✅ No Failures

Uh oh!

leslie-fang-intel Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Jul 24, 2024

Uh oh!

pytorchmergebot commented Jul 24, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

jgong5 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 25, 2024

Uh oh!

jgong5 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jgong5 commented Jul 16, 2024 •

edited

Loading

pytorch-bot bot commented Jul 16, 2024 •

edited

Loading