Skip to content

Commit 03d792c

Browse files
committed
Update on "[PGNCCL] Ensure comm is ready before all accesses"
Previously we only wait for comm to become ready after its initialization. But that's not enough. There are other NCCL APIs that can cause the comm to be InProgress. Therefore, we just ensure comm is ready every time we call `getNcclComm`, as a protection for subsequent NCCL call on the returned comm. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
2 parents 8808319 + 8edd31e commit 03d792c

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -505,7 +505,14 @@ class TORCH_API ProcessGroupNCCL : public Backend {
505505
// must be within the numerical range of C++ int. Otherwise, Python will
506506
// raise a RuntimeError saying type is incompatible. See also
507507
// `_process_group_color` in `distributed_c10d.py`.
508+
#ifdef NCCL_HAS_COMM_SPLIT
508509
int split_color{NCCL_SPLIT_NOCOLOR - 1};
510+
#else
511+
// [Note 3]: for older NCCL versions, NCCL_SPLIT_NOCOLOR is not defined. But
512+
// `split_color` is pybinded to Python, so we need to define it. So we use
513+
// the int value of `NCCL_SPLIT_NOCOLOR` (-1) instead.
514+
int split_color{-2};
515+
#endif
509516
std::vector<uint64_t> global_ranks_in_group;
510517
std::string group_name;
511518
};

0 commit comments

Comments
 (0)