[c10d] Land CudaEventCache with roll out flags #133727

fduwjj · 2024-08-16T21:28:52Z

Stack from ghstack (oldest at bottom):

-> [c10d] Land CudaEventCache with roll out flags #133727

@zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o @xmfan

[ghstack-poisoned]

pytorch-bot · 2024-08-16T21:28:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133727

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 47c8e05 with merge base e1b9b89 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: f1edc84 Pull Request resolved: #133727

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

wconstab · 2024-08-19T19:45:51Z

cc @eqy any potential issues with long-term reuse of CUDA events? one motivation for this PR is to avoid the case where ~CudaEvent causes a hang.

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 4e963e4 Pull Request resolved: #133727

fduwjj · 2024-08-19T23:12:21Z

@pytorchbot merge

pytorchmergebot · 2024-08-19T23:13:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501

LGTM. Added two minor comments.

kwen2501 · 2024-08-20T14:42:05Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

    const std::optional<std::vector<at::Tensor>>& inputs,
    bool desyncDebug,
    bool enableTiming,
+    bool cudaEventCacheEnabled,


nit: maybe we can make WorkNCCL a friend of ProcessGroupNCCL so that it can access ProcessGroupNCCL. cudaEventCacheEnabled_ and we don't have to pass every flag to the constructor of WorkNCCL?

kwen2501 · 2024-08-20T14:42:32Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+std::shared_ptr<at::cuda::CUDAEvent> ProcessGroupNCCL::CUDAEventCache::create(
+    bool timing) {
+  auto deleter = [this, timing](at::cuda::CUDAEvent* event) {
+    std::lock_guard<std::mutex> lock(this->cacheMutex_);
+    this->eventsArray_[timing ? 1 : 0].push_back(event);
+  };
+  at::cuda::CUDAEvent* event = nullptr;
+  {
+    std::lock_guard<std::mutex> lock(cacheMutex_);
+    auto events = eventsArray_[timing ? 1 : 0];
+    if (!events.empty()) {
+      event = events.back();
+      events.pop_back();
+    }
+  }
+  if (!event) {
+    event = new at::cuda::CUDAEvent(
+        timing ? cudaEventDefault : cudaEventDisableTiming);
+  }
+  return std::shared_ptr<at::cuda::CUDAEvent>(event, std::move(deleter));
+}
+
+ProcessGroupNCCL::CUDAEventCache& ProcessGroupNCCL::CUDAEventCache::get() {
+  static ProcessGroupNCCL::CUDAEventCache cache;
+  return cache;
+}
+


nit: can you add some comments for this block of code? Thanks!

We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…140975) We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: #140975 Approved by: https://github.com/eqy, https://github.com/kwen2501

…ytorch#140975) We added `CudaEventCache` in pytorch#133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: pytorch#140975 Approved by: https://github.com/eqy, https://github.com/kwen2501

[c10d] Land CudaEventCache with roll out flags

fe02b3f

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 16, 2024

fduwjj added a commit that referenced this pull request Aug 16, 2024

[c10d] Land CudaEventCache with roll out flags

85f695f

ghstack-source-id: f1edc84 Pull Request resolved: #133727

fduwjj marked this pull request as draft August 16, 2024 23:29

Update on "[c10d] Land CudaEventCache with roll out flags"

f403655

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

fduwjj requested review from kwen2501, shuqiangzhang, wconstab and zdevito August 19, 2024 19:40

fduwjj marked this pull request as ready for review August 19, 2024 19:41

shuqiangzhang reviewed Aug 19, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved

shuqiangzhang approved these changes Aug 19, 2024

View reviewed changes

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 19, 2024

Update on "[c10d] Land CudaEventCache with roll out flags"

47c8e05

zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 19, 2024

[c10d] Land CudaEventCache with roll out flags

251e640

ghstack-source-id: 4e963e4 Pull Request resolved: #133727

eqy approved these changes Aug 19, 2024

View reviewed changes

pytorchmergebot added the merging label Aug 19, 2024

pytorchmergebot added the Merged label Aug 20, 2024

pytorchmergebot closed this in 874ae85 Aug 20, 2024

pytorchmergebot removed the merging label Aug 20, 2024

kwen2501 reviewed Aug 20, 2024

View reviewed changes

github-actions bot deleted the gh/fduwjj/116/head branch September 28, 2024 02:08

fduwjj mentioned this pull request Nov 18, 2024

[c10d] Enable CudaEventCache by default and add multi device support #140975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Land CudaEventCache with roll out flags #133727

[c10d] Land CudaEventCache with roll out flags #133727

Uh oh!

fduwjj commented Aug 16, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 16, 2024 •

edited

Loading

Uh oh!

wconstab commented Aug 19, 2024

Uh oh!

Uh oh!

fduwjj commented Aug 19, 2024

Uh oh!

pytorchmergebot commented Aug 19, 2024

Uh oh!

kwen2501 left a comment

Uh oh!

kwen2501 Aug 20, 2024

Uh oh!

kwen2501 Aug 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[c10d] Land CudaEventCache with roll out flags #133727

[c10d] Land CudaEventCache with roll out flags #133727

Uh oh!

Conversation

fduwjj commented Aug 16, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133727

✅ No Failures

Uh oh!

wconstab commented Aug 19, 2024

Uh oh!

Uh oh!

fduwjj commented Aug 19, 2024

Uh oh!

pytorchmergebot commented Aug 19, 2024

Merge started

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fduwjj commented Aug 16, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 16, 2024 •

edited

Loading