Cache start/end events in nccl. #122732

zdevito · 2024-03-26T22:05:27Z

Stack from ghstack (oldest at bottom):

Adding an event cache has two goals:
(1) lower the overhead of of issuing collectives
(2) removes cudaEventDestroy from the watchdog thread.
If cuda gets stuck due to nccl, then cudaEventDestroy might
hang. This has traditionally gotten the watchdog thread stuck,
causing us to rely on a separate thread to make sure the watchdog
thread makes progress. With this change, we probably do not need that
thread anymore, but we can first check to see if we continue to find
any stack traces suggesting a heartbeat timeout after we land this change.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma

Adding an event cache has two goals: (1) lower the overhead of of issuing collectives (2) removes cudaEventDestroy from the watchdog thread. If cuda gets stuck due to nccl, then cudaEventDestroy might hang. This has traditionally gotten the watchdog thread stuck, causing us to rely on a separate thread to make sure the watchdog thread makes progress. With this change, we probably do not need that thread anymore, but we can first check to see if we continue to find any stack traces suggesting a heartbeat timeout after we land this change. [ghstack-poisoned]

pytorch-bot · 2024-03-26T22:05:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122732

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c2b2115 with merge base 29132c2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Adding an event cache has two goals: (1) lower the overhead of of issuing collectives (2) removes cudaEventDestroy from the watchdog thread. If cuda gets stuck due to nccl, then cudaEventDestroy might hang. This has traditionally gotten the watchdog thread stuck, causing us to rely on a separate thread to make sure the watchdog thread makes progress. With this change, we probably do not need that thread anymore, but we can first check to see if we continue to find any stack traces suggesting a heartbeat timeout after we land this change. ghstack-source-id: 1d05496 Pull Request resolved: #122732

wconstab

this LGTM. do you have some testing/instrumentation to confirm it actually works as intended?

shuqiangzhang · 2024-03-26T23:21:24Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  return resultFuture;
 }

+class CUDAEventCache {


by convention, Can we move the class declaration to .hpp file so that we can access the class anywhere in the cpp file ?

ProcessGroupNCCL.hpp is included in 5 compilation units, whereas code in ProcessGroupNCCL.cpp only is in one, so in cases where the code is only used locally I tend to avoid the header files to keep compile times down.

shuqiangzhang · 2024-03-26T23:36:25Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    at::cuda::CUDAEvent* event = nullptr;
+    {
+      std::lock_guard<std::mutex> lock(deviceCache.mutex);
+      auto& events = deviceCache.events[timing ? 1 : 0];


Maybe I am reading it wrong, Is deviceCache.events[timing ? 1 : 0] a vector of event* or just only 1 'event*?

deviceCache.events is two vector: a list of unused events without timing, and a list of unused events with timing. Since timing isn't strictly a global property (you can force timing on for a particular process group), we need to handle the situation where we fulfill both.

Oh, I got it now, maybe it is just to me, but this was the confusing part to me: std::vector<at::cuda::CUDAEvent*> events[2], it is using a combination of vector and array semantics. Maybe a vector of vector is better.

kwen2501 · 2024-03-27T03:23:14Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp


+class CUDAEventCache {
+ public:
+  CUDAEventCache() : caches_(at::cuda::device_count()) {}


ProcessGroupNCCL now supports single device per thread only. Would that help make the implementation here even simpler?

kwen2501 · 2024-03-27T03:24:45Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  }
-  ncclEndEvent_ = std::make_shared<at::cuda::CUDAEvent>(
-      enableTiming ? cudaEventDefault : cudaEventDisableTiming);
+  ncclEndEvent_ = CUDAEventCache::get().create(device.index(), enableTiming);


Any idea how big a performance difference there will be between disableTiming and enableTiming?

Here is one analysis I've seen:

https://github.com/harrism/cuda_event_benchmark

Using the items_per_second column, recording an event with timing is 2.5us but without timing it is .25us. Similarly it is about ~10x to cache the events vs cudaEventCreate/Destroy each time.

zdevito · 2024-03-27T18:27:48Z

I tested this locally with prints to ensure I see events getting reused. But I am looking for ideas of how to actually test it better.

Adding an event cache has two goals: (1) lower the overhead of of issuing collectives (2) removes cudaEventDestroy from the watchdog thread. If cuda gets stuck due to nccl, then cudaEventDestroy might hang. This has traditionally gotten the watchdog thread stuck, causing us to rely on a separate thread to make sure the watchdog thread makes progress. With this change, we probably do not need that thread anymore, but we can first check to see if we continue to find any stack traces suggesting a heartbeat timeout after we land this change. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang [ghstack-poisoned]

pritamdamania87 · 2024-04-03T21:03:40Z

(2) removes cudaEventDestroy from the watchdog thread.
If cuda gets stuck due to nccl, then cudaEventDestroy might
hang. This has traditionally gotten the watchdog thread stuck,
causing us to rely on a separate thread to make sure the watchdog
thread makes progress. With this change, we probably do not need that
thread anymore, but we can first check to see if we continue to find
any stack traces suggesting a heartbeat timeout after we land this change.

This isn't a reliable or general way to the solve the problem mentioned in #101463. As mentioned in #101463, the watchdog thread also calls things like isCompleted(): https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1647 which eventually call things like cudaGetLastError which can get stuck too. There are probably a lot of CUDA calls that occur which might not be easy to find and track down in the watchdog thread. That is why a separate minimalistic thread can avoid such situations more reliably.

kwen2501

LGTM!

github-actions · 2024-06-03T00:48:41Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

@zdevito

@zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. Pull Request resolved: #133727 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy

This was referenced Mar 26, 2024

Revert "[c10d] disable compute_duration by default (#122138)" #122539

Closed

Allow flight recorder to call elasped_time again. #122731

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 26, 2024

zdevito requested a review from shuqiangzhang March 26, 2024 22:07

wconstab reviewed Mar 26, 2024

View reviewed changes

shuqiangzhang reviewed Mar 26, 2024

View reviewed changes

kwen2501 reviewed Mar 27, 2024

View reviewed changes

wconstab mentioned this pull request Mar 28, 2024

Regression in NCCL error handling #101463

Open

eqy approved these changes Apr 4, 2024

View reviewed changes

kwen2501 approved these changes Apr 4, 2024

View reviewed changes

github-actions bot added the Stale label Jun 3, 2024

github-actions bot closed this Jul 3, 2024

github-actions bot deleted the gh/zdevito/260/head branch August 2, 2024 01:56

fduwjj mentioned this pull request Aug 19, 2024

[c10d] Land CudaEventCache with roll out flags #133727

Closed

Cache start/end events in nccl. #122732

Cache start/end events in nccl. #122732

Uh oh!

Conversation

zdevito commented Mar 26, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122732

✅ No Failures

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zdevito Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zdevito Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

zdevito Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

zdevito commented Mar 27, 2024

Uh oh!

pritamdamania87 commented Apr 3, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zdevito commented Mar 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 26, 2024 •

edited

Loading

shuqiangzhang Mar 26, 2024 •

edited

Loading

shuqiangzhang Mar 26, 2024 •

edited

Loading

shuqiangzhang Mar 28, 2024 •

edited

Loading