-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Bug
Possible root cause for #45435. CC @walterddr
Thanks to @jaglinux for the following triage information.
For barrier call, all reduce uses tensor of device type cuda with the following formula:
int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);
pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp
Line 1413 in a49367e
| int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs); |
But when the tests are called, the tensors uses different formula to calculate rank to GPU selection.
pytorch/torch/testing/_internal/distributed/distributed_test.py
Lines 367 to 374 in a49367e
| nGPUs_per_process = nGPUs // world_size | |
| rank_to_GPU = { | |
| i: list( | |
| visible_devices[i * nGPUs_per_process: (i + 1) * nGPUs_per_process] | |
| ) | |
| for i in range(world_size) | |
| } | |
| return rank_to_GPU |
Hence for rank 0,1,2 ; barrier uses tensors of cuda0, cuda1, cuda2 device type and for testing, tensors of cuda0, cuda2, cuda4 are used.
If we change rank_to_GPU data structure in distributed_test.py to return rank % numGPUs. With this change, all the tests are passing.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd