[PGNCCL] Add an API to get the status/error code of each PG #140087

shuqiangzhang · 2024-11-08T02:02:02Z

Stack from ghstack (oldest at bottom):

-> [PGNCCL] Add an API to get the status/error code of each PG #140087

Summary:
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective

Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently

Tags:

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2024-11-08T02:02:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140087

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 1a1049d with merge base e474f0d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 8025d59 Pull Request resolved: #140087

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 548ff56 Pull Request resolved: #140087

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f4ea1c5 Pull Request resolved: #140087

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7b55b9e Pull Request resolved: #140087

kwen2501 · 2024-11-09T02:33:04Z

torch/_C/_distributed_c10d.pyi

+class ErrorType(Enum):
+    NO_ERROR = ...
+    TIMEOUT = ...
+    NCCL_ERROR = ...


Generalize NCCL_ERROR to BACKEND_ERROR?

making it COMM_ERROR, similar to the workResult errors

kwen2501 · 2024-11-09T02:38:26Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

+enum TORCH_API ErrorType {
+  NO_ERROR = 0,
+  TIMEOUT = 1,
+  NCCL_ERROR = 2,
+  // TODO, do we need to distinguish between remote timeout or remote NCCL
+  // errors?
+  REMOTE_ERROR = 3,
+};
+


I guess we can define this API one level up in Backend.hpp?

kwen2501 · 2024-11-09T02:39:46Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

+  ErrorType getError();
+


Same suggestion for uplifting. For non-NCCL backends, we can leave it as unimplemented.

Was debating about this with myself too, can make it in any backend

kwen2501

Overall looks good to me.

kwen2501 · 2024-11-09T04:54:30Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+void ProcessGroupNCCL::broadcastSignal(
+    c10::intrusive_ptr<Store>& store,
+    const std::string& signal,
+    int srcRank) {


From the signature, it looks like we can make this a util function independent of ProcessGroupNCCL? Not a hard requirement though.

will keep it in PGNCCL first, we can move it later if other backends need it

kwen2501 · 2024-11-09T04:55:29Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    try {
+      auto vec = store->get(std::string(signal));
+      TORCH_CHECK_WITH(
+          DistBackendError,
+          vec.size() == sizeof(int),
+          "Invalid size for the timeout rank ID");
+      std::memcpy(&srcRank, vec.data(), vec.size());


Care to give a comment for this block?

kwen2501 · 2024-11-09T04:59:37Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    // Check and set remote error if it has not been set before
+    if (getError() == ErrorType::NO_ERROR) {
+      int remoteErrorRank =
+          getSignalSrcRank(store_, std::string(REMOTE_ERROR_SIGNAL));
+      if (remoteErrorRank != -1) {
+        std::lock_guard<std::mutex> lock(errorMutex_);
+        error_ = ErrorType::REMOTE_ERROR;
+        LOG(ERROR) << c10::str(
+            logPrefix(),
+            " remote error detected by watchdog thread from rank: ",
+            remoteErrorRank);
+      }
+    }
+


nit: modularize this as checkRemoteError?

kwen2501 · 2024-11-09T05:03:35Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+      if (work.exception() && getError() == ErrorType::NO_ERROR) {
+        // set the error to the first error found
+        std::lock_guard<std::mutex> lock(errorMutex_);
+        error_ = ErrorType::NCCL_ERROR;
+      }


nit: lock -> read -> write to guarantee atomicity.
Or, we can use std::atomic.

kwen2501 · 2024-11-09T05:06:36Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

 constexpr const char* EXCEPTION_DUMP = "exception_dump";

+constexpr const char* REMOTE_ERROR_SIGNAL = "remote_error";


nit: storeDumpKey, storeErrorSignalKey

kwen2501 · 2024-11-09T05:09:14Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+      int remoteErrorRank =
+          getSignalSrcRank(store_, std::string(REMOTE_ERROR_SIGNAL));


I haven't checked whether getSignalSrcRank could be blocking or not. Let's be careful when putting potentially blocking call in watchdog.

It's non blocking in a sense, we first 'check' if key exists which is nonblocking, and then read/get the value only if the key exists. Otherwise, if we try to get the kv directly, it could be blocking.

kwen2501 · 2024-11-09T05:09:39Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+        // broadcast remote error signal to all other ranks in this specific PG.
+        broadcastSignal(store_, std::string(REMOTE_ERROR_SIGNAL), rank_);
+


I haven't checked whether broadcastSignal could be blocking or not. Let's be careful when putting potentially blocking call in watchdog.

The same functionality was in watchdog thread in the original code

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…140087) Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: pytorch#140087 Approved by: https://github.com/kwen2501

…ytorch#140087)" This reverts commit 80aa19a. Reverted pytorch#140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#140087 (comment)))

shuqiangzhang · 2025-01-09T19:03:56Z

This PR has caused perf drop in some tests, this is most likely resulted from TCPStore read in watchdog thread. Moving it to another thread, e.g., monitor thread which has been already doing tcpstore reads, could avoid the perf drop. Will do it in another fresh PR, since this PR has been old and conflicts resolving is needed

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: [ghstack-poisoned]

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: ghstack-source-id: fb4ea62 Pull Request resolved: #144498

…r code at the PG level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…r code at the PG level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: ghstack-source-id: aca3df0 Pull Request resolved: #144498

…r code at the PG level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…r code at the PG level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…r code at the PG level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

… level" Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: ghstack-source-id: 3f945c9 Pull Request resolved: #144498

…4498) Summary: This PR is basically a replacement of #140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: #144498 Approved by: https://github.com/kwen2501

[PGNCCL] Add an API to check the healthiness of each PG

eabd106

Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 8, 2024

shuqiangzhang marked this pull request as draft November 8, 2024 02:02

shuqiangzhang changed the title ~~[PGNCCL] Add an API to check the healthiness of each PG~~ [PGNCCL] Add an API to get the status/error code of each PG Nov 8, 2024

shuqiangzhang requested review from c-p-i-o, d4l3k, fduwjj, kwen2501 and wconstab and removed request for kwen2501 November 8, 2024 22:46

shuqiangzhang marked this pull request as ready for review November 8, 2024 22:47

kwen2501 reviewed Nov 9, 2024

View reviewed changes

shuqiangzhang closed this Jan 9, 2025

shuqiangzhang mentioned this pull request Jan 9, 2025

[PGNCCL] Add an API to get the status/error code at the PG level #144498

Closed

github-actions bot deleted the gh/shuqiangzhang/62/head branch February 9, 2025 02:09

		constexpr const char* EXCEPTION_DUMP = "exception_dump";

		constexpr const char* REMOTE_ERROR_SIGNAL = "remote_error";

		int remoteErrorRank =
		getSignalSrcRank(store_, std::string(REMOTE_ERROR_SIGNAL));

		// broadcast remote error signal to all other ranks in this specific PG.
		broadcastSignal(store_, std::string(REMOTE_ERROR_SIGNAL), rank_);

[PGNCCL] Add an API to get the status/error code of each PG #140087

[PGNCCL] Add an API to get the status/error code of each PG #140087

Uh oh!

Conversation

shuqiangzhang commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140087

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shuqiangzhang commented Nov 8, 2024 •

edited

Loading

pytorch-bot bot commented Nov 8, 2024 •

edited

Loading

kwen2501 Nov 9, 2024 •

edited

Loading