-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Cache start/end events in nccl. #122732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache start/end events in nccl. #122732
Conversation
Adding an event cache has two goals:
(1) lower the overhead of of issuing collectives
(2) removes cudaEventDestroy from the watchdog thread.
If cuda gets stuck due to nccl, then cudaEventDestroy might
hang. This has traditionally gotten the watchdog thread stuck,
causing us to rely on a separate thread to make sure the watchdog
thread makes progress. With this change, we probably do not need that
thread anymore, but we can first check to see if we continue to find
any stack traces suggesting a heartbeat timeout after we land this change.
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122732
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c2b2115 with merge base 29132c2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Adding an event cache has two goals:
(1) lower the overhead of of issuing collectives
(2) removes cudaEventDestroy from the watchdog thread.
If cuda gets stuck due to nccl, then cudaEventDestroy might
hang. This has traditionally gotten the watchdog thread stuck,
causing us to rely on a separate thread to make sure the watchdog
thread makes progress. With this change, we probably do not need that
thread anymore, but we can first check to see if we continue to find
any stack traces suggesting a heartbeat timeout after we land this change.
ghstack-source-id: 1d05496
Pull Request resolved: #122732
wconstab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LGTM. do you have some testing/instrumentation to confirm it actually works as intended?
| return resultFuture; | ||
| } | ||
|
|
||
| class CUDAEventCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by convention, Can we move the class declaration to .hpp file so that we can access the class anywhere in the cpp file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProcessGroupNCCL.hpp is included in 5 compilation units, whereas code in ProcessGroupNCCL.cpp only is in one, so in cases where the code is only used locally I tend to avoid the header files to keep compile times down.
| at::cuda::CUDAEvent* event = nullptr; | ||
| { | ||
| std::lock_guard<std::mutex> lock(deviceCache.mutex); | ||
| auto& events = deviceCache.events[timing ? 1 : 0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am reading it wrong, Is deviceCache.events[timing ? 1 : 0] a vector of event* or just only 1 'event*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deviceCache.events is two vector: a list of unused events without timing, and a list of unused events with timing. Since timing isn't strictly a global property (you can force timing on for a particular process group), we need to handle the situation where we fulfill both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I got it now, maybe it is just to me, but this was the confusing part to me: std::vector<at::cuda::CUDAEvent*> events[2], it is using a combination of vector and array semantics. Maybe a vector of vector is better.
|
|
||
| class CUDAEventCache { | ||
| public: | ||
| CUDAEventCache() : caches_(at::cuda::device_count()) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProcessGroupNCCL now supports single device per thread only. Would that help make the implementation here even simpler?
| } | ||
| ncclEndEvent_ = std::make_shared<at::cuda::CUDAEvent>( | ||
| enableTiming ? cudaEventDefault : cudaEventDisableTiming); | ||
| ncclEndEvent_ = CUDAEventCache::get().create(device.index(), enableTiming); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how big a performance difference there will be between disableTiming and enableTiming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one analysis I've seen:
Using the items_per_second column, recording an event with timing is 2.5us but without timing it is .25us. Similarly it is about ~10x to cache the events vs cudaEventCreate/Destroy each time.
|
I tested this locally with prints to ensure I see events getting reused. But I am looking for ideas of how to actually test it better. |
Adding an event cache has two goals:
(1) lower the overhead of of issuing collectives
(2) removes cudaEventDestroy from the watchdog thread.
If cuda gets stuck due to nccl, then cudaEventDestroy might
hang. This has traditionally gotten the watchdog thread stuck,
causing us to rely on a separate thread to make sure the watchdog
thread makes progress. With this change, we probably do not need that
thread anymore, but we can first check to see if we continue to find
any stack traces suggesting a heartbeat timeout after we land this change.
cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang
[ghstack-poisoned]
This isn't a reliable or general way to the solve the problem mentioned in #101463. As mentioned in #101463, the watchdog thread also calls things like |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. Pull Request resolved: #133727 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy
Stack from ghstack (oldest at bottom):
Adding an event cache has two goals:
(1) lower the overhead of of issuing collectives
(2) removes cudaEventDestroy from the watchdog thread.
If cuda gets stuck due to nccl, then cudaEventDestroy might
hang. This has traditionally gotten the watchdog thread stuck,
causing us to rely on a separate thread to make sure the watchdog
thread makes progress. With this change, we probably do not need that
thread anymore, but we can first check to see if we continue to find
any stack traces suggesting a heartbeat timeout after we land this change.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma