[C10D] race at shutdown, store is gone when heartbeat monitor calls store->check

Repro is torchtrain w/
`TORCH_CPP_LOG_LEVEL=INFO TORCH_NCCL_ABORT_IN_DESTROY_PG=1 CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_llama_train.sh --checkpoint.folder ./test_runner_checkpoint_full_checkpoint`

```
[rank2]:frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xa1 (0x7f783390b2e1 in /data/users/whc/pytorch/torch/lib/libc10.so)                             
[rank2]:frame #1: <unknown function> + 0x58e288c (0x7f78184e288c in /data/users/whc/pytorch/torch/lib/libtorch_cpu.so)                                                                                                                       
[rank2]:frame #2: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const
&) + 0x223 (0x7f78184dbfb3 in /data/users/whc/pytorch/torch/lib/libtorch_cpu.so)                                                                                                                                                             
[rank2]:frame #3: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x3a2 (0x7f781cea8ab2 in /data/users/whc/pytorch/torch/lib/libtorch_cuda.so)                                                                                                  
[rank2]:frame #4: <unknown function> + 0xd3e95 (0x7f782b4f0e95 in /home/whc/.conda/envs/pytorch-3.10/lib/libstdc++.so.6)                                                                                                                     
[rank2]:frame #5: <unknown function> + 0x89c02 (0x7f7839089c02 in /lib64/libc.so.6)                                                                                                                                                          
[rank2]:frame #6: <unknown function> + 0x10ec40 (0x7f783910ec40 in /lib64/libc.so.6)                                                                                                                                                         
[rank2]:                                                                                                                                                                                                                                     
[rank2]:Fatal Python error: Aborted                                                                                                                                                                                                          
[rank2]:                                                                                                                                                                                                                                     
[rank2]:Thread 0x00007f778fe00640 (most recent call first):                                                                                                                                                                                  
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 324 in wait                                                                                                                                            
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 607 in wait                     
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run                                                                                                                            
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner                                                                                                                               
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 973 in _bootstrap                                                                                                                                      
[rank2]:                                                                                                                                                                                                                                     
[rank2]:Thread 0x00007f7839237480 (most recent call first):                                                                                                                                                                                  
[rank2]:  File "/data/users/whc/pytorch/torch/distributed/distributed_c10d.py", line 1404 in _shutdown_backend        
```


it happens more often if ` TORCH_NCCL_ABORT_IN_DESTROY_PG=0` in my experience but it still happens either way.

Logically its clear we have a race: rank0 is free to exit and shut down TCPStore before or after other ranks shut themselves down.

we have to consider either
a) rank 0 waits for all other ranks to exit before it exists
b) we make other ranks tolerant of tcpstore going away

(A) has potential downsides in efficiency even in the best case. But also practically, we could use tcpstore barrier,  but it wont work reliably at large scales. we'd have to consider something like 'scalable coordinator' to go this route

(B) we  could probably implement w/ a try/catch in the heartbeat monitor for the connection error, similar to how we catch cuda driver exit.  This feels a little hacky.  Its important we somehow distinguish between the cases when we are actually shutting down and want to swallow the error, from cases where the is a tcpstore/network error during runtime and we actually want to raise the error.  This points to (a) as a more correct direction.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C10D] race at shutdown, store is gone when heartbeat monitor calls store->check #123969

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C10D] race at shutdown, store is gone when heartbeat monitor calls store->check #123969

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions