Skip to content

Conversation

@shuqiangzhang
Copy link
Contributor

@shuqiangzhang shuqiangzhang commented Oct 3, 2024

Stack from ghstack (oldest at bottom):

code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137295

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Cancelled Jobs

As of commit fd74b29 with merge base 38114ec (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 3, 2024
shuqiangzhang added a commit that referenced this pull request Oct 3, 2024
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f089b63
Pull Request resolved: #137295
@shuqiangzhang shuqiangzhang marked this pull request as draft October 3, 2024 19:06
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please commit the suggested changes from pytorch's linter.

…CPU blocking"

code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…CPU blocking"

code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 3, 2024
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 34f7609
Pull Request resolved: #137295
@shuqiangzhang shuqiangzhang self-assigned this Oct 3, 2024
@shuqiangzhang shuqiangzhang marked this pull request as ready for review October 3, 2024 22:01
@shuqiangzhang shuqiangzhang changed the title [c10d] separte the concepts of stream synchronization and CPU blocking [c10d] separate the codes for GPU stream synchronization and CPU thread synchronization Oct 3, 2024
…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@shuqiangzhang shuqiangzhang requested a review from c-p-i-o October 7, 2024 17:35
…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 7, 2024
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: a80cb61
Pull Request resolved: #137295
…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 7, 2024
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 669bb0f
Pull Request resolved: #137295
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving a tentative approval.
Can you:

  • address the placement of synchronizeStream (see comment)
  • check if there is any customer of work.synchronize() (see comment)
    ?

Comment on lines +725 to +726
// syncrhoize() will block the current stream on the NCCL stream
synchronize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this change seems to move synchronizeStream after blocking wait.
In 99% of cases, this should be a no-change to code behavior, because the main thread is being hijacked by blocking wait.
I just worry about the 1% chance where another thread my have a handle on the compute stream too -- in case that, it would be nice to call synchronizeStream as early as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the comments lie about the 'no behavior change part'. The behavior change here is only when blockingwait mode is true or timeout is specified, where user is more concerned about blocking the CPU thread till the collective is done. Will monitor if any users is affected by this change, Later we might further remove GPU stream synchronize() for blockingwait mode.

Copy link
Collaborator

@kwen2501 kwen2501 Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not concerned about behavior change, as you mentioned, it only relates to blocking wait mode.
My comment was more on robustness, as in, would it be generally better to establish stream dependency asap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually put synchronize() before the cpu wait code in an early version. But it breaks the current barrier behavior which requires the CPU loop happens before the stream sync (https://www.internalfb.com/code/fbsource/[4713fa33ea08c1d66083e1bbe6c1fa4af785b424]/fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp?lines=695). which I am not sure it is safe to change. I am planning to remove the barrier blocks in the synchronize() method in future PRs and then we can switch the order

@kwen2501
Copy link
Collaborator

kwen2501 commented Oct 7, 2024

This PR should not change the existing behavior of work.wait()

While that is true, it also introduced a new behavior:
work.wait(timeout)

Can you document this behavior?

And in terms of UX, should user call it like this?

try:
  work.wait(timeout)
except:
  # some handling


private:
// Helper function for synchronize
void synchronizeInternal(std::chrono::milliseconds timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this function name is once very confusing to folks.

…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…nd CPU thread synchronization"


code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 7, 2024
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.
Test Plan:
Existing UT passed
Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: c8a7049
Pull Request resolved: #137295
@shuqiangzhang
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/shuqiangzhang/45/head branch November 8, 2024 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants