-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
Hi torch team, we recently upgraded our torch from 2.3 to 2.5.0dev0808. After the upgrade, we encountered issues with job automatic restart. The symptom is that during restart, the NCCL initialization would fail the error message was "Software caused connection issue", "Failed to connect to <ip_addr>:".
We did some debugging, and found the issue to be related to the sharing of TCP store between torchrun's rendezvous handler and training process's init_process_group(). which was default behavior after commit bb13fad, titled Share TCPStore by default when using c10d rdzv handler
It seems because of sharing TCP Store, the action of broadcastNCCLUniqueId does not work properly, sometimes it will get the uniqueID of previous restart, which resulted in wrong Port being used by NCCL's init.
We tried to disable the behavior of sharing TCP store by setting TORCH_DISABLE_SHARE_RDZV_TCP_STORE to 1, which seems to resolve the issue.
Root cause is the new shared rendezvous implementation behavior with restart counts. We use TORCHELASTIC_RESTART_COUNT to isolate each retry but this isn't globally consistent -- each worker keeps track of it locally so if some workers crash and restart it gets reset. This results in different workers using different PrefixStores.
pytorch/torch/distributed/rendezvous.py
Line 186 in 39a6179
| return PrefixStore(f"/worker/attempt_{attempt}", tcp_store) |
There's a couple of options for fixing this:
- disable shared tcp store (prior behavior -- has issues during shutdown)
- add a new global counter that's actually consistent across all hosts (but not all rendezvous implementations support counts)
- create a new tcpstore that is managed by elastic on every rendezvous (simplest)
For option 3: we actually already have the bulk of the code written for this -- we would just need to make it be the default case:
pytorch/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py
Lines 1192 to 1209 in 18a9030
| if isinstance(self._store, dist.TCPStore): | |
| addr = self._store.host | |
| port = self._store.port | |
| self._bootstrap_store_info = RendezvousStoreInfo( | |
| master_addr=addr, master_port=port | |
| ) | |
| if rank == 0: | |
| self._shared_tcp_store_server = self._store | |
| else: | |
| # If the store is not type of TCPStore start TCPStore server, which requries | |
| # bootstrapping info across ranks | |
| self._bootstrap_store_info = RendezvousStoreInfo.build( | |
| rank, store, local_addr=self._this_node.addr | |
| ) | |
| if rank == 0: | |
| self._shared_tcp_store_server = self._create_tcp_store_server( | |
| self._bootstrap_store_info | |
| ) |
Versions
nightly
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o