[c10d] Enable CudaEventCache by default and add multi device support #140975

fduwjj · 2024-11-18T19:59:26Z

Stack from ghstack (oldest at bottom):

-> [c10d] Enable CudaEventCache by default and add multi device support #140975

We added CudaEventCache in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.

Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

ghstack-source-id: a6f2343 Pull Request resolved: #140975

pytorch-bot · 2024-11-18T19:59:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140975

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 88fa03c with merge base d0fd42e ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh) (disabled by #137771)
distributed/test_c10d_ucc.py::DistributedDataParallelTest::test_save_load_checkpoint

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy

Failures look real, perhaps there's an issue with the cache being shared across streams/devices?

fduwjj · 2024-11-18T23:50:32Z

@eqy I cannot repro it locally...

kwen2501

Approving.
As discussed, a fix for the CI failure would be to create one cache per device index seen, as DDP still supports multi-device modules. (Though we may start deprecating it)

We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: a5d0a4d Pull Request resolved: #140975

We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 28d0988 Pull Request resolved: #140975

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 7286a70 Pull Request resolved: #140975

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

fduwjj · 2024-11-25T21:14:07Z

@pytorchbot rebase

pytorchmergebot · 2024-11-25T21:15:40Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-11-25T21:15:54Z

Successfully rebased gh/fduwjj/156/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/140975)

ghstack-source-id: 87c4b25 Pull Request resolved: #140975

fduwjj · 2024-11-26T18:34:31Z

@pytorchbot merge

pytorchmergebot · 2024-11-26T18:37:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#140975) We added `CudaEventCache` in pytorch#133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: pytorch#140975 Approved by: https://github.com/eqy, https://github.com/kwen2501

[c10d] Enable CudaEventCache by default

5e43f2b

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Nov 18, 2024

[c10d] Enable CudaEventCache by default

c0c6700

ghstack-source-id: a6f2343 Pull Request resolved: #140975

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 18, 2024

fduwjj requested review from c-p-i-o, eqy, kwen2501, shuqiangzhang, wconstab and xw285cornell and removed request for c-p-i-o, shuqiangzhang and wconstab November 18, 2024 20:00

eqy approved these changes Nov 18, 2024

View reviewed changes

eqy reviewed Nov 18, 2024

View reviewed changes

kwen2501 approved these changes Nov 19, 2024

View reviewed changes

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 19, 2024

fduwjj added a commit that referenced this pull request Nov 19, 2024

[c10d] Enable CudaEventCache by default

47cf60a

ghstack-source-id: a5d0a4d Pull Request resolved: #140975

fduwjj added a commit that referenced this pull request Nov 20, 2024

[c10d] Enable CudaEventCache by default

f0f2a10

ghstack-source-id: 28d0988 Pull Request resolved: #140975

fduwjj changed the title ~~[c10d] Enable CudaEventCache by default~~ [c10d] Enable CudaEventCache by default and add multi device support Nov 20, 2024

kwen2501 reviewed Nov 21, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

fduwjj added a commit that referenced this pull request Nov 21, 2024

[c10d] Enable CudaEventCache by default

4b2d2bb

ghstack-source-id: 7286a70 Pull Request resolved: #140975

fduwjj commented Nov 21, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

fduwjj commented Nov 21, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

Update

88fa03c

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Nov 25, 2024

[c10d] Enable CudaEventCache by default

5c40919

ghstack-source-id: 87c4b25 Pull Request resolved: #140975

pytorchmergebot added the merging label Nov 26, 2024

pytorchmergebot added the Merged label Nov 26, 2024

pytorchmergebot closed this in 5b4c864 Nov 26, 2024

pytorchmergebot removed the merging label Nov 26, 2024

github-actions bot deleted the gh/fduwjj/156/head branch December 27, 2024 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Enable CudaEventCache by default and add multi device support #140975

[c10d] Enable CudaEventCache by default and add multi device support #140975

Uh oh!

fduwjj commented Nov 18, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 18, 2024 •

edited

Loading

Uh oh!

eqy left a comment

Uh oh!

fduwjj commented Nov 18, 2024

Uh oh!

kwen2501 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fduwjj commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

fduwjj commented Nov 26, 2024

Uh oh!

pytorchmergebot commented Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[c10d] Enable CudaEventCache by default and add multi device support #140975

[c10d] Enable CudaEventCache by default and add multi device support #140975

Uh oh!

Conversation

fduwjj commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140975

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Nov 18, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fduwjj commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

pytorchmergebot commented Nov 25, 2024

Uh oh!

fduwjj commented Nov 26, 2024

Uh oh!

pytorchmergebot commented Nov 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fduwjj commented Nov 18, 2024 •

edited

Loading

pytorch-bot bot commented Nov 18, 2024 •

edited

Loading