Skip to content

[RFC] ProcessGroupNCCL uses non-blocking API by default #137007

@kwen2501

Description

@kwen2501

🚀 The feature, motivation and pitch

Motivation:

  • Gives better fault tolerance against hangs in comm init, comm destroy and P2P operations (involves dynamic connection performed by CPU).

  • Allow overlap between NCCL init and other stuff users want the main thread to do (e.g., model or data loader init)

Here "non-blocking" refers to whether a NCCL API call would immediately return or block the host CPU (Traditionally, some NCCL APIs such as ncclCommInitRank may block for a little when rendezvous is performed.)

If the user wants the main thread to do other stuff while NCCL is initializing, this mode would also help as it puts NCCL init to the background.

Alternatives

No response

Additional context

This knob is control by PyTorch here:

bool nccl_use_nonblocking() {
static bool nccl_use_nonblocking_ =
c10::utils::check_env("TORCH_NCCL_USE_COMM_NONBLOCKING") == true;
if (nccl_use_nonblocking_) {
TORCH_WARN_ONCE("Using experimental non-blocking NCCL communicator.");
}
return nccl_use_nonblocking_;
}

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Metadata

Metadata

Assignees

Labels

module: c10dIssues/PRs related to collective communications and process groupsmodule: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions