[CUDA][AMP] Speed up fp16/bf16 casts on H100+ #137053

eqy · 2024-09-30T21:55:59Z

Similar to #110251 we're seeing cases where vectorization can benefit casts to fp16/bf16

cc @ptrblck @msaroufim @mcarilli @leslie-fang-intel @jgong5

pytorch-bot · 2024-09-30T21:56:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137053

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8b45b71 with merge base 3f9f604 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2024-10-01T17:18:11Z

mind adding a quick perf sweep?

eqy · 2024-10-28T20:35:33Z

@drisspg sorry for the delay, here's a quick run on half on power-limited H100 PCI-E

basically mostly for larger sizes:
before

512 6.796240340918303e-06
2048 5.685428623110056e-06
8192 5.973570514470339e-06
32768 5.551059730350971e-06
131072 5.707070231437683e-06
524288 6.505390629172325e-06
2097152 9.025379549711942e-06
8388608 3.8858160842210054e-05
33554432 0.00014148149872198702
134217728 0.0005525871901772917
536870912 0.00222478736191988
2147483648 0.009108047320041805

after

512 7.028880063444376e-06
2048 6.492158863693476e-06
8192 6.225050892680883e-06
32768 6.103010382503271e-06
131072 5.990599747747183e-06
524288 6.801350973546505e-06
2097152 6.0079293325543405e-06
8388608 3.216813085600734e-05
33554432 0.00011356725823134183
134217728 0.0004374598595313728
536870912 0.0017357393098063768
2147483648 0.006942723749671131

eqy · 2024-10-29T20:15:39Z

@pytorchmergebot rebase

pytorchmergebot · 2024-10-29T20:17:18Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-10-29T20:17:23Z

Successfully rebased ampcopy onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ampcopy && git pull --rebase)

eqy · 2024-10-29T20:36:40Z

@pytorchmergebot merge

pytorchmergebot · 2024-10-29T20:38:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Similar to pytorch#110251 we're seeing cases where vectorization can benefit casts to fp16/bf16 Pull Request resolved: pytorch#137053 Approved by: https://github.com/drisspg

Similar to pytorch#110251 we're seeing cases where vectorization can benefit casts to fp16/bf16 Pull Request resolved: pytorch#137053 Approved by: https://github.com/drisspg Co-authored-by: eqy <[email protected]>

eqy added module: cuda Related to torch.cuda, and CUDA support in general open source module: bfloat16 module: half Related to float16 half-precision floats module: amp (automated mixed precision) autocast topic: not user facing topic category labels Sep 30, 2024

eqy requested a review from syed-ahmed as a code owner September 30, 2024 21:56

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 1, 2024

drisspg self-requested a review October 1, 2024 17:18

drisspg approved these changes Oct 28, 2024

View reviewed changes

eqy added 2 commits October 29, 2024 20:17

check in

e3d22ad

lint

8b45b71

pytorchmergebot force-pushed the ampcopy branch from 95ab406 to 8b45b71 Compare October 29, 2024 20:17

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 29, 2024

pytorchmergebot added the merging label Oct 29, 2024

pytorchmergebot added the Merged label Oct 29, 2024

pytorchmergebot closed this in c9bd712 Oct 29, 2024

pytorchmergebot removed the merging label Oct 29, 2024

jerrymannil mentioned this pull request Mar 17, 2025

[CUDA][AMP] Speed up fp16/bf16 casts on H100+ (#137053) ROCm/pytorch#1971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA][AMP] Speed up fp16/bf16 casts on H100+ #137053

[CUDA][AMP] Speed up fp16/bf16 casts on H100+ #137053

Uh oh!

eqy commented Sep 30, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 30, 2024 •

edited

Loading

Uh oh!

drisspg commented Oct 1, 2024

Uh oh!

eqy commented Oct 28, 2024

Uh oh!

eqy commented Oct 29, 2024

Uh oh!

pytorchmergebot commented Oct 29, 2024

Uh oh!

pytorchmergebot commented Oct 29, 2024

Uh oh!

eqy commented Oct 29, 2024

Uh oh!

pytorchmergebot commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CUDA][AMP] Speed up fp16/bf16 casts on H100+ #137053

[CUDA][AMP] Speed up fp16/bf16 casts on H100+ #137053

Uh oh!

Conversation

eqy commented Sep 30, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137053

✅ No Failures

Uh oh!

drisspg commented Oct 1, 2024

Uh oh!

eqy commented Oct 28, 2024

Uh oh!

eqy commented Oct 29, 2024

Uh oh!

pytorchmergebot commented Oct 29, 2024

Uh oh!

pytorchmergebot commented Oct 29, 2024

Uh oh!

eqy commented Oct 29, 2024

Uh oh!

pytorchmergebot commented Oct 29, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eqy commented Sep 30, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 30, 2024 •

edited

Loading