Skip to content

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts #135712

@d4l3k

Description

@d4l3k

🐛 Describe the bug

Hi torch team, we recently upgraded our torch from 2.3 to 2.5.0dev0808. After the upgrade, we encountered issues with job automatic restart. The symptom is that during restart, the NCCL initialization would fail the error message was "Software caused connection issue", "Failed to connect to <ip_addr>:".
We did some debugging, and found the issue to be related to the sharing of TCP store between torchrun's rendezvous handler and training process's init_process_group(). which was default behavior after commit bb13fad, titled Share TCPStore by default when using c10d rdzv handler
It seems because of sharing TCP Store, the action of broadcastNCCLUniqueId does not work properly, sometimes it will get the uniqueID of previous restart, which resulted in wrong Port being used by NCCL's init.
We tried to disable the behavior of sharing TCP store by setting TORCH_DISABLE_SHARE_RDZV_TCP_STORE to 1, which seems to resolve the issue.

Root cause is the new shared rendezvous implementation behavior with restart counts. We use TORCHELASTIC_RESTART_COUNT to isolate each retry but this isn't globally consistent -- each worker keeps track of it locally so if some workers crash and restart it gets reset. This results in different workers using different PrefixStores.

return PrefixStore(f"/worker/attempt_{attempt}", tcp_store)

There's a couple of options for fixing this:

  1. disable shared tcp store (prior behavior -- has issues during shutdown)
  2. add a new global counter that's actually consistent across all hosts (but not all rendezvous implementations support counts)
  3. create a new tcpstore that is managed by elastic on every rendezvous (simplest)

For option 3: we actually already have the bulk of the code written for this -- we would just need to make it be the default case:

if isinstance(self._store, dist.TCPStore):
addr = self._store.host
port = self._store.port
self._bootstrap_store_info = RendezvousStoreInfo(
master_addr=addr, master_port=port
)
if rank == 0:
self._shared_tcp_store_server = self._store
else:
# If the store is not type of TCPStore start TCPStore server, which requries
# bootstrapping info across ranks
self._bootstrap_store_info = RendezvousStoreInfo.build(
rank, store, local_addr=self._this_node.addr
)
if rank == 0:
self._shared_tcp_store_server = self._create_tcp_store_server(
self._bootstrap_store_info
)

Versions

nightly

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions