Skip to content

Show a warning if current device index is lower than current local rank #1308

@vfdev-5

Description

@vfdev-5

🚀 Feature

Following #1307, if user does not set torch.cuda.set_device("cuda:lrank"), ignite's code

def _compute_nproc_per_node(self):
tensor = torch.tensor([self.get_local_rank() + 1]).to(self.device())
dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
return tensor.item()

will use the same device cuda:0 for all_reduce op.

For older NCCL, it will setup itself such that i-th proc uses cuda:0 device and thus following collective op will hang with other devices. For example

import torch
import torch.distributed as dist

def main():

    # !!! We do not call torch.cuda.set_device("cuda:lrank")

    dist.init_process_group(backend="nccl", init_method="env://")
    import os
    local_rank = int(os.environ["LOCAL_RANK"])

    tensor = torch.tensor([local_rank + 1]).to("cuda")
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    tensor = torch.tensor([local_rank + 1]).to("cuda:{}".format(local_rank))
    # PROGRAM WILL HANG HERE >>>>
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    dist.destroy_process_group()

if __name__ == "__main__":
    main()

For newer NCCL, it raises the error as in #1307.

Let's improve the code by raising a warning for native and horovod dist models when calling idist.device() if we encounter the situation where current cuda device index is smaller than the local rank.
PyTorch docs suggest that to use 1 proc per 1 cuda device => local rank should be equal to cuda device index.
However, it is also possible to have M procs with K devices / proc (e.g. 4 procs with 2 GPUs per proc) => local rank <= cuda device index.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions