Skip to content

[C10D] race at shutdown, store is gone when heartbeat monitor calls store->check #123969

@wconstab

Description

@wconstab

Repro is torchtrain w/
TORCH_CPP_LOG_LEVEL=INFO TORCH_NCCL_ABORT_IN_DESTROY_PG=1 CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_llama_train.sh --checkpoint.folder ./test_runner_checkpoint_full_checkpoint

[rank2]:frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xa1 (0x7f783390b2e1 in /data/users/whc/pytorch/torch/lib/libc10.so)                             
[rank2]:frame #1: <unknown function> + 0x58e288c (0x7f78184e288c in /data/users/whc/pytorch/torch/lib/libtorch_cpu.so)                                                                                                                       
[rank2]:frame #2: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const
&) + 0x223 (0x7f78184dbfb3 in /data/users/whc/pytorch/torch/lib/libtorch_cpu.so)                                                                                                                                                             
[rank2]:frame #3: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x3a2 (0x7f781cea8ab2 in /data/users/whc/pytorch/torch/lib/libtorch_cuda.so)                                                                                                  
[rank2]:frame #4: <unknown function> + 0xd3e95 (0x7f782b4f0e95 in /home/whc/.conda/envs/pytorch-3.10/lib/libstdc++.so.6)                                                                                                                     
[rank2]:frame #5: <unknown function> + 0x89c02 (0x7f7839089c02 in /lib64/libc.so.6)                                                                                                                                                          
[rank2]:frame #6: <unknown function> + 0x10ec40 (0x7f783910ec40 in /lib64/libc.so.6)                                                                                                                                                         
[rank2]:                                                                                                                                                                                                                                     
[rank2]:Fatal Python error: Aborted                                                                                                                                                                                                          
[rank2]:                                                                                                                                                                                                                                     
[rank2]:Thread 0x00007f778fe00640 (most recent call first):                                                                                                                                                                                  
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 324 in wait                                                                                                                                            
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 607 in wait                     
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run                                                                                                                            
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner                                                                                                                               
[rank2]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 973 in _bootstrap                                                                                                                                      
[rank2]:                                                                                                                                                                                                                                     
[rank2]:Thread 0x00007f7839237480 (most recent call first):                                                                                                                                                                                  
[rank2]:  File "/data/users/whc/pytorch/torch/distributed/distributed_c10d.py", line 1404 in _shutdown_backend        

it happens more often if TORCH_NCCL_ABORT_IN_DESTROY_PG=0 in my experience but it still happens either way.

Logically its clear we have a race: rank0 is free to exit and shut down TCPStore before or after other ranks shut themselves down.

we have to consider either
a) rank 0 waits for all other ranks to exit before it exists
b) we make other ranks tolerant of tcpstore going away

(A) has potential downsides in efficiency even in the best case. But also practically, we could use tcpstore barrier, but it wont work reliably at large scales. we'd have to consider something like 'scalable coordinator' to go this route

(B) we could probably implement w/ a try/catch in the heartbeat monitor for the connection error, similar to how we catch cuda driver exit. This feels a little hacky. Its important we somehow distinguish between the cases when we are actually shutting down and want to swallow the error, from cases where the is a tcpstore/network error during runtime and we actually want to raise the error. This points to (a) as a more correct direction.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions