Skip to content

Conversation

@chunyuan-w
Copy link
Collaborator

@chunyuan-w chunyuan-w commented Jun 25, 2024

Stack from ghstack (oldest at bottom):

Use non-temporal tile load _tile_stream_loadd for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129455

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 415b882 with merge base cb2bce9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.


**TODOs:**
- [ ] Collect benchmark data before and after this change

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 25, 2024
ghstack-source-id: 0089b94
Pull Request resolved: #129455
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Share perf numbers?

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.


**TODOs:**
- [ ] Collect benchmark data before and after this change

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chunyuan-w added a commit that referenced this pull request Jun 26, 2024
ghstack-source-id: 040bbba
Pull Request resolved: #129455
Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.

## Performance data
Performance speedups with >=5% on BF16 AMP, with this PR vs. without this PR, measured on CPU with AMX support:
- Static shapes Single-threaded

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | timm_vision_transformer | 5%
huggingface | MT5ForConditionalGeneration | 5%
huggingface | MobileBertForMaskedLM | 5%
timm_models | gmixer_24_224 | 5%

No perf regressions.

TODO: collect benchmark for
- Static shapes Multi-threaded
- Dynamic shapes Single-threaded
- Dynamic shapes Multi-threaded

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@chunyuan-w
Copy link
Collaborator Author

Share perf numbers?

Updated the performance status in the PR description:
No obvious performance boost (no regression either) at end-to-end level by only changing to non-temporal load. We're expecting to get more performance gain after tuning the cache blocking (#129348 also in this ghstack) on top of this PR.

Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding #129348 (also in this ghstack) on top of this PR.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@chunyuan-w chunyuan-w marked this pull request as ready for review July 12, 2024 08:55
@chunyuan-w chunyuan-w requested a review from jgong5 July 12, 2024 08:55
@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 15, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

francograndegmailcom pushed a commit to francograndegmailcom/pytorch-pytorch that referenced this pull request Jul 23, 2024
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024
Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding pytorch#129348 (also in this ghstack) on top of this PR.

Pull Request resolved: pytorch#129455
Approved by: https://github.com/jgong5
@github-actions github-actions bot deleted the gh/chunyuan-w/19/head branch August 15, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants