[RFC] ProcessGroupNCCL uses non-blocking API by default

### 🚀 The feature, motivation and pitch

## Motivation:

- Gives better fault tolerance against hangs in comm init, comm destroy and P2P operations (involves dynamic connection performed by CPU).

- Allow overlap between NCCL init and other stuff users want the main thread to do (e.g., model or data loader init)

Here "non-blocking" refers to whether a NCCL API call would immediately return or block the host CPU (Traditionally, some NCCL APIs such as ncclCommInitRank may block for a little when rendezvous is performed.)

If the user wants the main thread to do other stuff while NCCL is initializing, this mode would also help as it puts NCCL init to the background.

### Alternatives

_No response_

### Additional context

This knob is control by PyTorch here:
https://github.com/pytorch/pytorch/blob/26956980c6953b42e57ad889b87444c4abccec1c/torch/csrc/distributed/c10d/NCCLUtils.cpp#L90-L97

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

	bool nccl_use_nonblocking() {
	static bool nccl_use_nonblocking_ =
	c10::utils::check_env("TORCH_NCCL_USE_COMM_NONBLOCKING") == true;
	if (nccl_use_nonblocking_) {
	TORCH_WARN_ONCE("Using experimental non-blocking NCCL communicator.");
	}
	return nccl_use_nonblocking_;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] ProcessGroupNCCL uses non-blocking API by default #137007

🚀 The feature, motivation and pitch

Motivation:

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] ProcessGroupNCCL uses non-blocking API by default #137007

Description

🚀 The feature, motivation and pitch

Motivation:

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions