-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[c10d] separate the codes for GPU stream synchronization and CPU thread synchronization #137295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137295
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 Cancelled JobsAs of commit fd74b29 with merge base 38114ec ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f089b63 Pull Request resolved: #137295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please commit the suggested changes from pytorch's linter.
…CPU blocking" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…CPU blocking" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 34f7609 Pull Request resolved: #137295
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a80cb61 Pull Request resolved: #137295
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 669bb0f Pull Request resolved: #137295
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving a tentative approval.
Can you:
- address the placement of
synchronizeStream(see comment) - check if there is any customer of
work.synchronize()(see comment)
?
| // syncrhoize() will block the current stream on the NCCL stream | ||
| synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this change seems to move synchronizeStream after blocking wait.
In 99% of cases, this should be a no-change to code behavior, because the main thread is being hijacked by blocking wait.
I just worry about the 1% chance where another thread my have a handle on the compute stream too -- in case that, it would be nice to call synchronizeStream as early as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the comments lie about the 'no behavior change part'. The behavior change here is only when blockingwait mode is true or timeout is specified, where user is more concerned about blocking the CPU thread till the collective is done. Will monitor if any users is affected by this change, Later we might further remove GPU stream synchronize() for blockingwait mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not concerned about behavior change, as you mentioned, it only relates to blocking wait mode.
My comment was more on robustness, as in, would it be generally better to establish stream dependency asap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually put synchronize() before the cpu wait code in an early version. But it breaks the current barrier behavior which requires the CPU loop happens before the stream sync (https://www.internalfb.com/code/fbsource/[4713fa33ea08c1d66083e1bbe6c1fa4af785b424]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=695). which I am not sure it is safe to change. I am planning to remove the barrier blocks in the synchronize() method in future PRs and then we can switch the order
While that is true, it also introduced a new behavior: Can you document this behavior? And in terms of UX, should user call it like this? |
|
|
||
| private: | ||
| // Helper function for synchronize | ||
| void synchronizeInternal(std::chrono::milliseconds timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this function name is once very confusing to folks.
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…nd CPU thread synchronization" code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. Test Plan: Existing UT passed Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c8a7049 Pull Request resolved: #137295
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.
Also, remove the need of a private synchronization function.
In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.
Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o