TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts

### 🐛 Describe the bug

> Hi torch team, we recently upgraded our torch from 2.3 to 2.5.0dev0808. After the upgrade, we encountered issues with job automatic restart. The symptom is that during restart, the NCCL initialization would fail the error message was "Software caused connection issue", "Failed to connect to  <ip_addr>:<port>".
We did some debugging, and found the issue to be related to the sharing of TCP store  between torchrun's rendezvous handler and training process's init_process_group(). which was default behavior after commit bb13fad7aa7754042efe6e9465410cf5e543a77e, titled Share TCPStore by default when using c10d rdzv handler
It seems because of sharing TCP Store, the action of broadcastNCCLUniqueId does not work properly, sometimes it will get the uniqueID of previous restart, which resulted in wrong Port being used by NCCL's init.
We tried to disable the behavior of sharing TCP store by setting TORCH_DISABLE_SHARE_RDZV_TCP_STORE to 1, which seems to resolve the issue.

Root cause is the new shared rendezvous implementation behavior with restart counts. We use `TORCHELASTIC_RESTART_COUNT` to isolate each retry but this isn't globally consistent -- each worker keeps track of it locally so if some workers crash and restart it gets reset. This results in different workers using different `PrefixStores`.

https://github.com/pytorch/pytorch/blob/39a61795e3ee41eff4dfe76da14b2535cf47429b/torch/distributed/rendezvous.py#L186

There's a couple of options for fixing this:

1. disable shared tcp store (prior behavior -- has issues during shutdown)
2. add a new global counter that's actually consistent across all hosts (but not all rendezvous implementations support counts)
3. create a new tcpstore that is managed by elastic on every rendezvous (simplest)

For option 3: we actually already have the bulk of the code written for this -- we would just need to make it be the default case: https://github.com/pytorch/pytorch/blob/18a90309527e2685e0dafc916de2e17c086b679a/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L1192-L1209


### Versions

nightly

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o

	if isinstance(self._store, dist.TCPStore):
	addr = self._store.host
	port = self._store.port
	self._bootstrap_store_info = RendezvousStoreInfo(
	master_addr=addr, master_port=port
	)
	if rank == 0:
	self._shared_tcp_store_server = self._store
	else:
	# If the store is not type of TCPStore start TCPStore server, which requries
	# bootstrapping info across ranks
	self._bootstrap_store_info = RendezvousStoreInfo.build(
	rank, store, local_addr=self._this_node.addr
	)
	if rank == 0:
	self._shared_tcp_store_server = self._create_tcp_store_server(
	self._bootstrap_store_info
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts #135712

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts #135712

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions