-
-
Notifications
You must be signed in to change notification settings - Fork 692
Show a warning if current device index is lower than current local rank #1308
Description
🚀 Feature
Following #1307, if user does not set torch.cuda.set_device("cuda:lrank"), ignite's code
ignite/ignite/distributed/comp_models/native.py
Lines 99 to 102 in 0c41778
| def _compute_nproc_per_node(self): | |
| tensor = torch.tensor([self.get_local_rank() + 1]).to(self.device()) | |
| dist.all_reduce(tensor, op=dist.ReduceOp.MAX) | |
| return tensor.item() |
will use the same device
cuda:0 for all_reduce op.
For older NCCL, it will setup itself such that i-th proc uses cuda:0 device and thus following collective op will hang with other devices. For example
import torch
import torch.distributed as dist
def main():
# !!! We do not call torch.cuda.set_device("cuda:lrank")
dist.init_process_group(backend="nccl", init_method="env://")
import os
local_rank = int(os.environ["LOCAL_RANK"])
tensor = torch.tensor([local_rank + 1]).to("cuda")
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
print(tensor)
tensor = torch.tensor([local_rank + 1]).to("cuda:{}".format(local_rank))
# PROGRAM WILL HANG HERE >>>>
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
print(tensor)
dist.destroy_process_group()
if __name__ == "__main__":
main()For newer NCCL, it raises the error as in #1307.
Let's improve the code by raising a warning for native and horovod dist models when calling idist.device() if we encounter the situation where current cuda device index is smaller than the local rank.
PyTorch docs suggest that to use 1 proc per 1 cuda device => local rank should be equal to cuda device index.
However, it is also possible to have M procs with K devices / proc (e.g. 4 procs with 2 GPUs per proc) => local rank <= cuda device index.