Show a warning if current device index is lower than current local rank

## 🚀 Feature

Following #1307, if user does not set `torch.cuda.set_device("cuda:lrank")`, ignite's code
https://github.com/pytorch/ignite/blob/0c417789599341d4b18a0b43ea35a9e7a7ea06b2/ignite/distributed/comp_models/native.py#L99-L102
will use the same device `cuda:0` for `all_reduce` op.

For older NCCL, it will setup itself such that i-th proc uses `cuda:0` device and thus following collective op will hang with other devices. For example
```python
import torch
import torch.distributed as dist

def main():

    # !!! We do not call torch.cuda.set_device("cuda:lrank")

    dist.init_process_group(backend="nccl", init_method="env://")
    import os
    local_rank = int(os.environ["LOCAL_RANK"])

    tensor = torch.tensor([local_rank + 1]).to("cuda")
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    tensor = torch.tensor([local_rank + 1]).to("cuda:{}".format(local_rank))
    # PROGRAM WILL HANG HERE >>>>
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
    print(tensor)

    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```


For newer NCCL, it raises the error as in #1307. 

Let's improve the code by raising a warning for native and horovod dist models when calling `idist.device()` if we encounter the situation where current cuda device index is smaller than the local rank.
PyTorch docs suggest that to use 1 proc per 1 cuda device => local rank should be equal to cuda device index.
However, it is also possible to have M procs with K devices / proc (e.g. 4 procs with 2 GPUs per proc) => local rank <= cuda device index. 






	def _compute_nproc_per_node(self):
	tensor = torch.tensor([self.get_local_rank() + 1]).to(self.device())
	dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
	return tensor.item()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Show a warning if current device index is lower than current local rank #1308

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Show a warning if current device index is lower than current local rank #1308

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions