[PGNCCL] Add an API to get the status/error code at the PG level #144498

shuqiangzhang · 2025-01-09T21:35:39Z

Stack from ghstack (oldest at bottom):

-> [PGNCCL] Add an API to get the status/error code at the PG level #144498

Summary:
This PR is basically a replacement of
#140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: [ghstack-poisoned]

pytorch-bot · 2025-01-09T21:35:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144498

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ca0a0f0 with merge base 015c6d6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shuqiangzhang · 2025-01-09T21:38:52Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

    auto currentTime = std::chrono::steady_clock::now();

+    // Check and set remote error if it has not been set before
+    checkAndSetRemoteError();


This is moved from watchdog thread to monitor thread, compared to #140087.

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: ghstack-source-id: fb4ea62 Pull Request resolved: #144498

shuqiangzhang · 2025-01-13T21:52:38Z

Realized that the error signal propagated should also be at PG level, so that the failure at one PG should not propagated to other PGs automatically

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: ghstack-source-id: aca3df0 Pull Request resolved: #144498

shuqiangzhang · 2025-01-22T01:50:07Z

Update the PR with 1: Per PG ERROR signal, instead of a 'global' error signal. 2: ENV to control whether enable TCPSTORE based broadcast of the signal for more gradual rollout of the feature

kwen2501

Useful API! I just have some minor comments.

torch/_C/_distributed_c10d.pyi

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

kwen2501 · 2025-01-23T03:10:53Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

 constexpr const char* NCCL_BACKEND_NAME = "nccl";

-constexpr const char* EXCEPTION_DUMP = "exception_dump";
+constexpr const char* kStoreDumpKey = "exception_dump";


not related to this PR but can exception_dump be just dump?

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

shuqiangzhang · 2025-01-23T22:36:06Z

@pytorchbot merge

pytorchmergebot · 2025-01-23T22:37:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-01-23T22:38:03Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Check mergeability of ghstack PR / ghstack-mergeability-check

Details for Dev Infra team

Raised by workflow job

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: ghstack-source-id: 3f945c9 Pull Request resolved: #144498

shuqiangzhang · 2025-01-24T00:04:52Z

@pytorchbot merge

pytorchmergebot · 2025-01-24T00:06:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-01-24T06:05:12Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

shuqiangzhang · 2025-01-24T16:45:48Z

@pytorchbot merge -f "merge timed out in the last attempt"

pytorchmergebot · 2025-01-24T16:47:17Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

shuqiangzhang requested a review from kwen2501 as a code owner January 9, 2025 21:35

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jan 9, 2025

shuqiangzhang commented Jan 9, 2025

View reviewed changes

shuqiangzhang requested review from c-p-i-o, fduwjj and wconstab January 9, 2025 21:39

shuqiangzhang changed the title ~~<Replace this line with a title. Use 1 line only, 67 chars or less>~~ [PGNCCL] Add an API to get the status/error code of each PG Jan 9, 2025

shuqiangzhang changed the title ~~[PGNCCL] Add an API to get the status/error code of each PG~~ [PGNCCL] Add an API to get the status/error code at the PG level Jan 9, 2025

shuqiangzhang requested a review from d4l3k January 9, 2025 21:40

shuqiangzhang marked this pull request as draft January 13, 2025 21:46

shuqiangzhang marked this pull request as ready for review January 22, 2025 01:48

kwen2501 approved these changes Jan 23, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 23, 2025

pytorchmergebot added the merging label Jan 23, 2025

pytorchmergebot removed the merging label Jan 23, 2025

pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 23:06 Error

pytorch-bot bot had a problem deploying to upload-benchmark-results January 23, 2025 23:51 Error

pytorchmergebot added the merging label Jan 24, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 24, 2025 00:35 Inactive

pytorchmergebot added the Merged label Jan 24, 2025

pytorchmergebot closed this in c0861d0 Jan 24, 2025

pytorchmergebot removed the merging label Jan 24, 2025

github-actions bot deleted the gh/shuqiangzhang/63/head branch February 24, 2025 02:06

[PGNCCL] Add an API to get the status/error code at the PG level #144498

[PGNCCL] Add an API to get the status/error code at the PG level #144498

Uh oh!

Conversation

shuqiangzhang commented Jan 9, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144498

✅ No Failures

Uh oh!

shuqiangzhang Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Jan 13, 2025

Uh oh!

shuqiangzhang commented Jan 22, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Jan 23, 2025

Uh oh!

pytorchmergebot commented Jan 23, 2025

Merge started

Uh oh!

pytorchmergebot commented Jan 23, 2025

Merge failed

Uh oh!

shuqiangzhang commented Jan 24, 2025

Uh oh!

pytorchmergebot commented Jan 24, 2025

Merge started

Uh oh!

pytorchmergebot commented Jan 24, 2025

Uh oh!

shuqiangzhang commented Jan 24, 2025

Uh oh!

pytorchmergebot commented Jan 24, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shuqiangzhang commented Jan 9, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 9, 2025 •

edited

Loading