Skip to content

Conversation

@jgong5
Copy link
Collaborator

@jgong5 jgong5 commented Jul 16, 2024

Stack from ghstack (oldest at bottom):

This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by local_buf_ptrs which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output Y. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.

The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.

Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.

Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
cpp_packed_gemm_0 0.0108 ms 100.0%
_linear_pointwise 0.0431 ms 25.1%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
cpp_packed_gemm_0 0.0272 ms 100.0%
_linear_pointwise 0.0892 ms 30.5%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
cpp_packed_gemm_0 0.0781 ms 100.0%
_linear_pointwise 0.1693 ms 46.1%

With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
cpp_packed_gemm_0 0.0260 ms 100.0%
_linear_pointwise 0.0444 ms 58.5%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
cpp_packed_gemm_0 0.0275 ms 100.0%
_linear_pointwise 0.0893 ms 30.8%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
cpp_packed_gemm_0 0.0284 ms 100.0%
_linear_pointwise 0.1686 ms 16.8%

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130821

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7d7befe with merge base 1614891 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request Jul 16, 2024
ghstack-source-id: fa224b6
Pull Request resolved: #130821
[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request Jul 18, 2024
ghstack-source-id: 3ccfa1f
Pull Request resolved: #130821
[ghstack-poisoned]
[ghstack-poisoned]
Jiong Gong added 2 commits July 21, 2024 16:48
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@jgong5 jgong5 changed the title [inductor][cpp][gemm] support k slicing [inductor][cpp][gemm] support k slicing for static shapes Jul 21, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@jgong5 jgong5 marked this pull request as ready for review July 23, 2024 14:32
[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request Jul 23, 2024
@jgong5 jgong5 added the topic: not user facing topic category label Jul 24, 2024
return False
if self.is_dynamic_M:
# TODO(jgong5): perhaps use size hint to decide?
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since num_k_slices is 1 for dynamic M, so, anyway k-slicing will not work for this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't want to generate k-slicing related code when it is dynamic M for now.

@jgong5
Copy link
Collaborator Author

jgong5 commented Jul 24, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 24, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jgong5
Copy link
Collaborator Author

jgong5 commented Jul 25, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jgong5
Copy link
Collaborator Author

jgong5 commented Jul 25, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants